Part 1 - Kubernetes Control Plane and Loss of Service
Summary
Standard Kubernetes Clusters can leave you exposed to Loss of Service and Broken clusters!
I thought Kubernetes is self-healing and can’t fail.
Yes and no. The Kubernetes Control Plane is made highly available by running it across multiple servers however, its components, binaries and configurational database are not immune to security breaches, failed upgrades or miss configuration.
A Control Plane component failure can leave your cluster broken with potential loss of service or even data loss.
A failed Kubernetes upgrade, miss configuration or security breach, can break your cluster—causing downtime or even data loss!
If that’s the case, why do we even need a control-plane in the first place?
The Control Plane
The Control Plane is an integral, non-optional component of every Kubernetes platform and is rather a good thing.
The Control Plane is the Kubernetes magic ingredient that groups together a number of servers for administration via a single API endpoint—up to 5,000 servers or “nodes”. No more SSH Shells into individual servers.
Configuration drift, duplication of effort and complexity are greatly reduced, removing the burden of managing a fleet of servers individually.
For more details, see the documentation on Kubernetes Control-Plane Components.

Figure 1 – Multiple Single Clusters
A Control Plane is how Kubernetes centralizes all server and app configurations, reducing duplication and allowing a group of servers to acts as a single cluster.
Note: While a fault on any single cluster node server will not effect the cluster as a whole—any issues with the Control Plane will effect every node on the cluster.
Managed Kubernetes Services
Cloud Service Provider (CSP) Kubernetes distributions—such as Amazon EKS, Azure AKS and Google GKE—don’t help with this problem much either.
While the Cloud provider will take care of making your Control Plane highly available for you, this won’t make your applications anymore resilient to cluster level failures should they occur.
You’ll still need to contend with the fact that a single cluster equates to a single control plane which may not be enough high availability for mission critical applications.
Service Risks
While your applications are likely to continue running during a cluster level fault, advanced Kubernetes functions such as self-healing, application rollout and rollback capabilities can fail.
The following Kubernetes functions can be affected:
Pod scheduling
Pod fault detection & recovery
Application rollouts/rollbacks
Cluster configuration
- Cluster upgrades
Cluster Upgrades
If you’ve custom-built your Kubernetes clusters—you’re on your own when it comes to performing cluster upgrades.
Successful Cluster Upgrade
Achieving a successful Kubernetes cluster upgrade demands a substantial amount of time and carries inherent risks to service availability.
What’s more, if you’ve custom-built your Kubernetes clusters—and there are valid reasons for doing so—you’re on your own when it comes to performing cluster upgrades.
Node Pool Cluster Upgrades
Fortunately, Cloud Service Providers make this arduous process easier for us with managed node pools (like Amazon’s EKS Managed Node Groups, Azure’s AKS node pools, and Google’s GKE node pools). These services automate the process of performing security patching and upgrades for both your control plane and worker node servers. Node pool cluster upgrades will automatically process the cordoning and draining of pods from nodes during upgrades, ensuring your applications remain available throughout the process.
Maintenance Windows
Managed node pools strive to perform cluster upgrades and security patching while maintaining complete service availability however, maintenance windows are still advised since problems can arise from time to time.
Consult your Cloud Service Provider documentation on how to configure a maintenance schedule.
Mission Critical Cluster Upgrades
Managed node pools execute in situ upgrades that require maintenance windows. This may pose an unacceptable level of service disruption for mission-critical applications. In which case, you may want to opt for Blue/Green – cluster upgrades instead. While more expensive—due to duplicated system resources and more coordinated effort—they guarantee optimal service availability with minimal to zero downtime.
Kubernetes API Server Deprecation
Newer versions of Kubernetes often phase out certain APIs and features, potentially causing issues for running applications. Test the behavior of your applications against a new Kubernetes version before you update your production clusters.
Conclusion
In conclusion, while single clusters offer simplicity and ease of management, they may suffer control plane outages and do not adequately address requirements such as data sovereignty, global distribution, or mission-critical application support. On the other hand, multi-cluster architectures provide solutions to these challenges but introduce complexities that need to be carefully considered.
Lastly. while Cloud Provider managed node pools will patch and upgrade your Kubernetes Clusters with minimal downtime, you’ll need to regularly test applications against deprecated API Server resources and schedule maintenance windows.
Mission critical applications may require an entirely bespoke upgrade strategy such as Blue/Green – cluster upgrades.
Next Up
In Part 2 – Single or Multi-Cluster Kubernetes?, we’ll compare the key differences between single vs multi-clusters and detail the many benefits a multi-cluster approach has to offer, and some of their problems too.
Additionally, we take a brief look at what the cloud service providers are currently offering in terms of multi-cluster management. If you have high-performance workloads, this is the post for you.
In Part 3 – Kubernetes Multi-Clusters, we take a look at the different multi-cluster architectures and present a synopsis on several approaches to unifying cluster management, application deployment and Control Plane synchronizaiton that can help you mitigate the service risks discussed here in Part 1.
Key points
Approximate reading time
6 min
Share post
Other posts

Tony Barganski is a Principal Consultant at a tech consultancy in London, where he focuses on meeting the cloud computing needs of London’s financial institutions. Tony consistently delivers impactful solutions and has played a pivotal role in driving business advancements for his clients.
2 Responses