High Availability Design
When architecting solutions on Octos Cloud, high availability (HA) ensures your services remain accessible even in the event of component failures. A robust design typically employs orchestrators like Docker Swarm or Kubernetes.
Core HA Principles
Where possible, services must be highly available. This means eliminating single points of failure at every layer of the stack.
- Manager Nodes Quorum: Run an odd number of manager nodes (typically 3 or 5) to maintain cluster quorum. If one manager fails, the cluster continues normal operation.
- Worker Node Replication: Deploy service replicas across multiple worker nodes. If a worker node fails, the orchestrator automatically re-schedules its tasks to healthy nodes.
- Ingress and Routing: Use a modern reverse proxy like Traefik deployed on manager nodes to route traffic to healthy service endpoints dynamically.
Handling Failure Scenarios
Normal Function
Under normal operation, traffic flows through load balancers or edge routers (like Traefik) and is distributed across multiple backend replicas running on your Instances.
Node Failure
If an Instance fails (e.g., hardware fault or OS panic):
- The orchestrator detects the missed heartbeats.
- Containers on the failed node are marked as down.
- New container instances are automatically spun up on remaining healthy nodes.
Node Restore
When the failed node recovers and rejoins the cluster:
- It becomes available for scheduling again.
- Future tasks and updates will be scheduled on this node, rebalancing the workload naturally.
Total Cluster Failure
In the rare event of a total cluster failure (loss of manager quorum), worker nodes will continue to run their existing containers, but no new deployments or scheduling changes can occur until the manager quorum is restored. Regular automated backups and snapshots are essential for disaster recovery in this scenario.
Best Practices
- Place Instances across different physical hypervisors if supported.
- Use replicated persistent storage (like Ceph) or carefully constrain stateful workloads to specific nodes with regular backups.
- Ensure your Security Group rules allow internal cluster communication across all nodes in the Virtual Network.