Building A Highly Available Docker Swarm Cluster.

Introduction.

Docker Swarm is a powerful tool for orchestrating containerized applications. It allows developers to deploy and manage multiple containers across multiple hosts. One of the key advantages of Docker Swarm is its simplicity compared to Kubernetes. Despite its simplicity, Swarm provides robust features for production environments. High availability is one of the most important aspects of running containers at scale. A highly available cluster ensures that services continue running even when nodes fail.

In Docker Swarm, high availability primarily depends on the manager nodes. Managers maintain the cluster state and coordinate tasks across workers. If a manager fails in a single-manager setup, the entire cluster can become non-functional. To prevent downtime, it is recommended to run multiple manager nodes. An odd number of managers ensures quorum for decision-making in the cluster. Three or five managers are typical choices for production-grade clusters. Worker nodes, on the other hand, run the actual services and can be scaled easily. Swarm automatically redistributes tasks if a worker node goes down.

This self-healing ability is essential for maintaining service uptime. Swarm uses the Raft consensus algorithm to synchronize cluster state among managers. Each manager keeps a full copy of the Raft log to ensure consistency. When a manager fails, other managers continue operating without disruption.
High availability is not limited to managers; it also applies to service deployment. By specifying replicas, Swarm can distribute containers across multiple nodes. This prevents a single point of failure for any critical service.

Networking is another critical component of high availability in Swarm. The overlay network allows services to communicate seamlessly across hosts. Swarm also includes a built-in routing mesh for load balancing. This ensures that traffic is automatically directed to healthy service instances. Security is important in a production cluster, and Swarm addresses this with TLS. All communications between nodes are encrypted by default. Secrets and configuration management allow sensitive data to be handled safely. With the right architecture, a Swarm cluster can tolerate multiple node failures.

This reduces the risk of downtime and improves reliability for end users. Monitoring and logging are essential to detect issues before they impact services. Tools like Prometheus, Grafana, and Portainer integrate easily with Swarm. By continuously monitoring cluster health, administrators can respond proactively. Backups of the Raft data directory are recommended for disaster recovery. Even though Kubernetes is more popular, Swarm remains relevant for simpler setups. Its lightweight design and ease of use make it ideal for small teams and edge deployments.

Swarm allows rolling updates, so services can be updated without downtime. This is critical for production applications that require continuous availability. By following best practices, administrators can build resilient Swarm clusters. These clusters provide fault tolerance, scalability, and ease of management. Developers can focus on building applications rather than handling failures.

In this blog, we will explore step-by-step how to create a highly available Swarm cluster. We will cover manager and worker setup, deployment, and testing for resilience. By the end, you will understand how to run Swarm confidently in production. This introduction sets the stage for hands-on instructions and practical tips. High availability is achievable without complex orchestration tools. Docker Swarm provides a balance of simplicity and robustness for modern deployments.

What Does “Highly Available” Mean in Docker Swarm?

In Docker Swarm, availability mainly depends on your manager nodes they maintain the cluster state and handle orchestration.

If you have one manager, and it fails, your cluster is down.
If you have multiple managers, the cluster can survive failures as long as a majority (quorum) remains available.

So, for high availability:

Rule of Thumb: Run an odd number of manager nodes typically 3 or 5 to maintain quorum.

Worker nodes, on the other hand, just run services. If one fails, Docker Swarm automatically reschedules containers elsewhere.

Architecture Overview

Let’s imagine we’re building a 3-manager, 2-worker cluster:

[ Manager 1 ] —\
[ Manager 2 ] ——> Form a Swarm cluster (quorum)
[ Manager 3 ] —/
[ Worker 1  ] → Runs containers
[ Worker 2  ] → Runs containers

Each manager has a full copy of the Raft log, ensuring that even if one or two managers fail, the cluster can still function.

Prerequisites

You’ll need:

5 Linux machines (VMs or cloud instances)
Docker Engine (v20+)
SSH access
Basic familiarity with Docker CLI

Example setup:

manager1: 192.168.1.10
manager2: 192.168.1.11
manager3: 192.168.1.12
worker1: 192.168.1.13
worker2: 192.168.1.14

Step-by-Step: Setting Up the HA Swarm

1. Initialize the Swarm

On manager1, initialize the Swarm and advertise its IP:

docker swarm init --advertise-addr 192.168.1.10

Docker will output two tokens:

One for joining managers
One for joining workers

You can get them again later with:

docker swarm join-token manager
docker swarm join-token worker

2. Add Manager Nodes

On manager2 and manager3, join the Swarm using the manager token:

docker swarm join --token <manager-token> 192.168.1.10:2377

Now verify from manager1:

docker node ls

You should see three managers:

ID                            HOSTNAME   STATUS  AVAILABILITY  MANAGER STATUS
z7t1...dlf7                   manager1   Ready   Active        Leader
9hfe...2kji                   manager2   Ready   Active        Reachable
l1px...0abc                   manager3   Ready   Active        Reachable

3. Add Worker Nodes

On worker1 and worker2, join using the worker token:

docker swarm join --token <worker-token> 192.168.1.10:2377

They’ll appear as workers when you run docker node ls again.

4. Test High Availability

Now, stop Docker on one manager:

sudo systemctl stop docker

Run docker node ls again from another manager you’ll see the failed node as Down, but the cluster continues to operate because you still have quorum.

If you bring down two managers, the Swarm loses quorum and becomes read-only (no new tasks or updates). That’s why three or five managers is the sweet spot.

Deploy a Test Service

To verify everything works:

docker service create --name web --replicas 5 -p 80:80 nginx

Now run:

docker service ps web

You’ll see containers distributed across worker nodes (and possibly managers, depending on constraints).

Swarm will automatically rebalance if a node goes down.

Hardening for Production

A few best practices before you go live:

Wrapping Up

Docker Swarm might not be the new kid on the block anymore, but it still provides a simple, powerful way to run containers in a resilient way.

With a 3-manager setup, your cluster can handle node failures gracefully, automatically reschedule containers, and keep your apps online all without the complexity of Kubernetes.

Jeevisoft blog

Building a Highly Available Docker Swarm Cluster.