Kubernetes Node Retirement

When AWS Systems Manager marks a node for retirement, companies must gracefully terminate work on that node.

Kubernetes

The problem

From time to time, AWS will need to repair or upgrade a server in its network. If you are using that server, AWS Systems manager will send you an alert letting you know this node has been marked for retirement. This doesn’t happen that often, so many companies haven’t designed a way to gracefully terminate their work on this node. As a result, AWS will often be forced to take the server offline, abruptly killing any services running on this box. This can often lead to data loss and customer downtime.

The solution

Shoreline makes it easy to cleanly handle nodes marked for retirement. First, Shoreline has a pre-built alarm that triggers whenever a node is marked for retirement by AWS EC2. From there, Shoreline automates the process of cordoning, draining and terminating these nodes. This process then triggers Kubernetes to automatically spin up another version of this node. This approach ensures that all services running on this box are gracefully terminated without interrupting any transactions. Sometimes nodes marked for retirement get stuck part way through the retirement process. If this happens, Kubernetes may still think the node is online and won’t spin up a new version of this node. In this case, Shoreline will then terminate or restart the box, which at least ensures that the right capacity is available for all applications.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Monthly for fleets with many nodes
Medium
Shoreline time to repair
0 minutes to repair
Low
Time to diagnose manually
Security
Cost impact
Time to repair manually
2-4 manual hours to repair
High

Related Solutions