Intro
In this blog, we will discuss the intersection of Site Reliability Engineering (SRE) and Kubernetes security. Kubernetes SREs are emerging as the guardians of scalability, reliability, and even Kubernetes disaster recovery, and while their primary role is to ensure the resilience of critical applications, SREs find themselves with many common service level objectives that can best be achieved through Kubernetes security best practices.
What is the role of SRE in Kubernetes?
When it comes to Kubernetes, Site Reliability Engineers, aka SREs, are quickly becoming the resident experts for managing and building a scalable Kubernetes environment. The core of the SRE role is to ensure the resiliency of critical applications. Key responsibilities include implementing observability and automation best practices, monitoring and reporting on performance, documenting best practices and even helping with incident response procedures and processes. Most SRE roles require familiarity with DevOps and, you guessed it, Kubernetes. The Kubernetes SRE role is usually the function that standardizes the way teams interact with and build on Kubernetes as a central environment.
What is the role of the SRE in Kubernetes security?
It is not always obvious why, even though SREs have a vested interest in scalable Kubernetes, they should be concerned with securing Kubernetes. But because the role of an SRE for Kubernetes means keeping Kubernetes up and running, in practice, SREs are very closely related to the ‘availability’ goal in the Confidentiality, Integrity, and Availability model so common to security practitioners.
While there are certain classes of attacks that can result in downtime (aka, Denial of Service attacks), most often a cyber adversary is not just interested in crashing your system— they would rather steal data or gain control. Most outages are actually the result of “accidents”: mistakes made by developers or engineers that result in a system failing, whether on your network or on a network you depend on.
Suffice it to say, it is ultimately in the SRE’s best interest to have a stake in the overall security of the Kubernetes environment.
When a Kubernetes Outage Could be a Security Event
Kubernetes was designed for internet-scale enterprise operations (literally, Google). One of the key design considerations of anything “internet scale” is that it needs to rely heavily on protocols which can operate without a lot of direct intervention. For example, internet routing tables largely build and update themselves without the need for manual administration. Similarly, Kubernetes has several “protocols” of its own that result in workloads being distributed, scaled, and restored without anyone needing to type any additional commands. When this is working as intended, it should make an SRE’s life easier. Of course, nothing always works as intended.
For example, imagine a bug in an image causes workloads to consistently crash under predictable circumstances. Somehow this bug escaped notice before production deployment, so now workloads are consistently crashing in production. Kubernetes wants to maintain these workloads, so after every crash it tries to relaunch them. That's what Kubernetes is designed to do, and normally that automatic recovery protocol is an SRE’s friend. But given enough repeatedly failing deployments, this same mechanism could trigger a Denial of Service incident, consuming enough resources across the cluster so as to prevent other services from successfully launching or causing the cluster itself to start experiencing failures.
While such a bug might be accidental—a coding mistake requiring a patch and better testing moving forward—the same scenario could also be the result of a supply chain attack (a bad base image used throughout your environment) or could be triggered by an attacker exploiting a known vulnerability to repeatedly trigger the crash circumstances and cause a cascading failure event. In either case, SRE’s will want to know something about how Kubernetes security works in order to diagnose and (likely in collaboration with their security partners) remediate the issue. For example, in addition to patching the bug, a known bad base image may warrant an update to the cluster’s admissions control rules to prevent that image being deployed again. And your uptime may benefit from implementing firewall rules protecting against the exploitation of a known vulnerability.
Kubernetes Disaster Recovery
The theoretical case of a Denial of Service incident like we just described is, as we’ve mentioned, an exception, not the rule. However, there are other circumstances under which “security” and “SRE” regularly meet, often as part of disaster recovery in Kubernetes after either an outage or a security event.
In general, Kubernetes backup and recovery often require elevated permissions. Systems need to be updated, restarted, or re-deployed with new configurations. In a Kubernetes environment, these permissions should be restricted with a combination of RBAC policies and infrastructure access controls (for example, AWS IAM roles and policies). Those policies should be designed to enforce the Principal of Least Privilege, a cornerstone of security practice. During an outage, however, these restricted privileges may make an SRE’s life a lot harder.
There are two paths forward for a Kubernetes disaster recovery strategy in the face of these restrictions. Option one is to invoke “emergency” protocols and use a “break-glass” account with privileged access to the cluster. This represents an intentional escalation of privileges and should only be invoked to address critical and urgent circumstances. It should also be done with your security team’s awareness and backing (ideally, they should have “alarmed” any use of break-glass accounts). Option two, and the more ideal move, is to perfect your DevSecOps workflow so that changes can be made via GitOps instead of through manual intervention with elevated permissions. This not only preserves the RBAC permission that exists, it also provides (via the git history) a clear record of what was changed, when, and by whom. This makes retrospective analysis and future planning that much clearer and ensures that changes are “reproducible” and not dependent on the memory of whoever implemented them in a moment of crisis.
A second Kubernetes recovery strategy in which the concerns of security and SRE intersect is recovery from a (non Denial-of-Service) security event. If there has been an intrusion into your systems, your security team will undoubtedly want to isolate the affected systems from the network and perform at least some forensics data collection on those systems. This will result in those systems being offline for a certain period of time, potentially creating an outage or putting stress on the overall system’s performance. This is particularly acute in a Kubernetes environment, where individual nodes may play host to many different apps or tenants of apps, potentially causing wide-ranging effects depending on the size of the cluster and the depth of the incident. If the incident involves a control plane compromise, it may require “isolating” an entire cluster. As a result, SRE’s may need to deploy new nodes to absorb redeployed workloads as a kind of Kubernetes cluster backup, or even employ new clusters with a full roll-over strategy to maintain system availability while an incident is contained and eradicated.
As security incidents like this are often evolving and murky, the number of affected systems and the time they need to be offline may both be something of a moving target. SRE’s need to be involved in these recovery efforts to help limit the impact to the broader organization while effectively isolating and eradicating the threat. This requires a strong partnership between SRE teams and security, with both teams understanding one another’s processes and requirements, including Kubernetes backup and recovery.
Good security architecture correlates with good resiliency architecture
Ultimately, the goal of SRE is to build architectures that are resilient so that outages are rare and quickly fixable. It turns out that there is a fair amount of overlap between architecting for resilience and architecting for security.
For example, we have written before about the importance of configuring and deploying your Kubernetes cluster using Infrastructure as Code (IaC) and Configuration as Code (CaC). IaC/CaC provides a mechanism for ensuring security standards are applied in a repeatable and predictable way across all iterations of a platform or application deployment. They can also be scanned for misconfigurations, allowing you to “shift security left”, and used as a baseline to prevent configuration drift. Many of these same benefits are important to resilience. Predictable, repeatable deployment patterns ensure that steps taken to improve performance and resilience cannot be forgotten. And preventing configuration drift reduces the likelihood of accidents brought on by manual changes causing an outage on a critical system.
Other examples abound. One of the core tenants of SRE is to architect systems in a “clean” and modular way that lets each component focus on its one job (and ideally be scaled according to the requirements of that job). Security architectures can also benefit from this paradigm. For example, implementing Kubernetes authentication that uses an enterprise IdP and/or a Zero Trust architecture follows this approach. Another core tenant is improved visibility into running systems to (ideally) identify errors and problems before they turn into an outage. This overlaps with a security teams own interest in collecting and monitoring logs to identify potential security events in the cluster. And indeed many other Kubernetes security best practices— using RBAC, limiting container privileges, storing secrets in a vault, defining allowed east-west connections, using defined service accounts and namespaces (instead of defaults), etc.— have substantial benefits both for security and for the resilience of the system’s architecture. These provide opportunities for synergy between security and SRE teams that can improve outcomes for both.
Summary of security benefits for SRE in Kubernetes
- Incident investigation: determine whether the Kubernetes outage is caused by a supply chain attack or benign image bug
- Disaster recovery: dealing with restrictions via repeatable GitOps processes in the DevSecOps workflow versus implementing insecure “break-glass” privileges, resulting in more streamlined issue analysis and easier future planning
- Quicker MTTD and MTTR: keep down-time from any incident, and the amount of planning required to work around downtime, to a minimum
- Consistency: repeatable and predictable deployment patterns are good for security as well as resiliency
- Modular approach to development: a Zero Trust security design supports the clean and modular approach to system architecture that SREs strive to create
- Proactive approach: improved Kubernetes security visibility also helps SRE teams be proactive in identifying issues before they turn into downtime or latency
How KSOC can help secure Kubernetes for SRE teams
There are many aspects of the KSOC platform that SREs and engineers alike find useful. Below are just some examples:
Above we said that a way to prevent disruption, in the case of a known bad base image, is by updating the admission control rules across clusters to prevent that image from being deployed again. This is easy with KSOC, and policies can even be enforced straight from the code in the cluster.
To do a quick diagnosis of whether a performance issue is security related or not, KSOC shows the workloads and resources attached to any particular threat vectors, making for an easy check. Below is an example of a list of the workloads connected to a threat vector where a service account with excessive permission is mounted in an exposed workload:
Based on the timestamps above, you can see that the detection capabilities of the risk are in real-time. The historical data around threat vectors is maintained so you can always look back and see everything that happened, lowering Mean Time to Detection (MTTD) and Mean Time to Response (MTTR), which can limit any potential offline time.
With KSOC, you can also change your policy straight from GitOps, avoiding any manuel intervention with elevated permission and preserving RBAC policies.
Improved Kubernetes security visibility into your running workloads, in real-time, is one of the top use-cases of KSOC for both security and SRE teams, allowing SRE teams to be proactive with any issues that could cause downtime later down the line, especially while increasing overall scale of the environment.
Conclusion
Security has always had at least a theoretical interest in the “availability” of services, but most downtime events are the result of mistakes or misconfigurations, not attacks. This has led to the rise of the SRE discipline focused on resilient system architectures. There are still times when an outage could be either the result of or the side effect of a security event (or the recovery from one), and SREs should be aware of these kinds of circumstances. But even apart from these events, many Kubernetes security best practices are also best practices for a resilient system architecture, giving SREs and security teams an opportunity to be partners and collaborators in improving the overall posture of their environments.