Securing your multi-cloud Terraform pipelines with policy-as-code
Learn how to secure your multi cloud with terraform pipeline through policy as code (PaC) and infrastructure as code (IaC). Read on!
Jun 08, 2023 • 12 Minute Read
- Software Development
For years, many of our organizations have relied on policies, procedures, and work instructions to achieve cloud compliance and maximize assurance. Review committees, change boards, and similar teams have been established to review and approve proposed changes, manage internal governance standards, and audit changes made to the IT infrastructure.
Before the prevalence of automation, scanning and collaboration tools, the change review process was manual, requiring a potentially significant time investment. One of the results now is the possibility of long delays between when a change is requested (proposed) and when it's implemented.
Although far too common, this approach rarely results in "cloud scale" agility or responsiveness. The result is what I lovingly term "suckurity." Yes, these controls increase our security posture, but the experience largely sucks.
How to secure your cloud infrastructure?
Today we have a plethora of tools and platforms. These tools range from open-source to fully-supported commercial SaaS applications. We also have a wide variety of infrastructure platforms to manage—public cloud, private cloud, and traditional on-premises infrastructure (or should I say “legacy,” in the context of the cloud world?). Within each of these you have traditional bare-metal systems, virtual machines, containers, serverless, etc. Each of these "flavors" has a variety of available orchestration and management platforms available.
Many organizations have a combination of different technologies, effectively taking a best-of-breed approach, such as:
- Public cloud (multiple providers)
- Private cloud
- Traditional on-premises (utilizing both virtualized and non-virtualized infrastructure, sometimes with multiple hypervisors)
This isn't saying the above is bad, just that there's a lot of disparity and moving pieces. Best-of-breed has a lot of advantages, however one of the drawbacks is that it can result in a higher level of complexity. Just integrating the disparate platforms together so they can effectively talk amongst themselves can be a challenge. Then looking at all of the operational complexity (patching, vulnerability management, etc.), it's clear that we require more skillsets than with a more consolidated, standardized approach.
Again, best-of-breed can deliver the most value to an organization, but there is a cost associated with it that must be weighed by the organization. The cost is not simply around licensing, hardware, and vendor maintenance agreements, but also encompasses the operational demands and efficiency.
Any solution to this problem needs to support a wide variety of platforms. It should be extensible and support a largely infinite number of integrations. In an effort to avoid re-inventing the wheel, let's examine how we used to do things. Maybe we can re-use (or adapt) what used to work in yesteryear for today.
Quick caveat: I should mention that I work for Oracle. I lead the DevSecOps efforts on the Developer Relations team. For that reason, I will be using some Oracle Cloud Infrastructure (OCI) terms here. The principles are largely portable and will likely be intuitive (if not familiar) if you've worked with different cloud providers. To put in a blatant plug for OCI, if you've not tried it out, you should. Sign up for an Always Free Oracle Cloud account today!
Going back to our roots…
Let's take another trip down memory lane. There's a point to this, bear with me! Our IT infrastructure used to consist largely of switches, routers, firewalls, servers and storage. Although things were a bit more simplistic (in some regards, anyway), we still had the same concerns as we do today. How do we keep bad/unauthorized users out? How do we let good/authorized users in? Even if they're an authorized user, how do we ensure they can only access and/or change what they need to (should) be changing?
For years we safeguarded our important infrastructure (routers, switches and firewalls) with AAA. AAA stands for Authentication, Authorization, and Accountability. Of course, there are lots of different variations of this as well (different names or terms used), but the fundamental aspects are largely the same, regardless of vendor.
Authentication is largely concerned about validating that you are who you say you are. Authorization ensures that you're only able to do what you've been granted permission to do. Accountability has to do with logging and reviewing the audit logs for what's taken place—the recording (and viewing of) who-did-what-when-and-where records.
In prior years, protocols such as RADIUS and TACACS+ have been used to communicate between devices and a management system (the RADIUS or TACACS+ server). Many modern systems (particularly cloud platforms) might not support RADIUS or TACACS+, but there are similarities and concepts that we can take from the past and apply to the present (and future).
|Authentication||RADIUS/TACACS+, joined to Active Directory or other LDAP directory.||IAM, federated to IdM/IdP|
|Authorization||RADIUS/TACACS+, with configured roles/profiles dictating what can/cannot be done.||IAM policies|
|Accountability||RADIUS/TACACS+ and device logs forwarded to a SIEM/SIMS||Audit Log|
Wow, looking at the comparison above, it might feel like our job is done. Let's make sure that we're federated and have the right IAM policies in place and we're good-to-go in the modern cloud world. Let's pat ourselves on the back! Not so fast…
What's wrong with IAM?
Nothing is really wrong with IAM. But we need to realize that IAM is not a one-size-fits-all solution. Look at the following scenarios to see where IAM falls short:
|User A would like to create a new Subnet in the Virtual Cloud Network (VCN).||Yes|
|User A should only be permitted to create a private Subnet in the VCN.||No|
|User A would like to create a new compute instance in their Subnet.||Yes|
|User A should only use a whitelisted compute image for any compute instances.||No|
The above list is just a really small sampling of what might be possible in a cloud. Although this isn't an exhaustive list, it's easy to see that IAM isn't a perfect solution to our problem.
IAM focuses on the "big blocks" of what you can/cannot do on the cloud platform. It's not concerned with the color of the blocks, or what kind of decorations might be present on the blocks, etc. Looking at the cloud, IAM is typically not concerned with the detailed attributes. IAM is a great fit for configuring different roles for different aspects of the cloud management. For instance, you can easily create a network role, granting that role the ability to manage Subnets, Route Tables, Security Lists and Virtual Cloud Networks (VCNs). IAM will not allow you to control access based on more specific attributes, such as the name of a resource, permitted (or blacklisted) route targets (for a Route Table Rule), permitted CIDRs (configured in Security Lists), etc.
But that's what other cloud security services are for!
True, to an extent. Many cloud providers have security services beyond just simple IAM. OCI is no exception, with Security Zones being just one OCI service which helps mitigate some of the above risks and concerns. If everyone used OCI exclusively (what a great thought!), this article could stop here. But what do we do in a world of multiple cloud providers? Although IAM is largely similar between cloud providers, the additional security services that provide the extra level of detailed authorization that we need will vary widely from one platform the other. Cloud services are as wide and varied as there are cloud providers.
We could leverage each cloud provider's security services to help shore up our security posture. This is certainly an ideal end-state. Part of the challenge here is a lack of feature parity from one cloud provider to the next. Depending on the functionality differences, we may or may not be able to guarantee a high level of assurance that each environment will fall within the allotted boundaries we've established. Even in situations where full feature parity is present on all cloud platforms in use, expertise is needed for each different cloud platform to achieve this desired end-state. This means staff, time and, money to get there (assuming that each cloud platform has enough native functionality to get there at all).
We should leverage all of the built-in native security functionality available within OCI (and any other cloud platforms), however, in a multi-cloud environment, this can be a steep hill to climb all at once. This is where taking a crawl-walk-run approach can be beneficial. To jumpstart a stronger security posture in a multi-cloud world, there is…
What is Policy as Code (PaC)?
Policy-as-Code (PaC) is a way to define a set of boundaries in which different roles may work. These boundaries might be expressed as allowlists (only permitting specific attributes/values), blocklists (permitting everything but a specific set of values/attributes), or any combination thereof.
PaC is typically best used with Infrastructure-as-Code (IaC). This is because IaC already gives a definition of how to interact with our infrastructure in a programmatic (usually declarative) fashion. PaC can read that code and assess it for compliance with a given policy, yielding a go/no-go decision as to whether or not it's compliant.
Tight integration with IaC is a good thing, as most of us are managing our environments with code (such as HashiCorp Terraform) already. Suffice it to say that having a blanket "yes" response to the "Should I be using IaC?" question is usually the right answer.
So many choices…
There are many PaC solutions available, as there are many IaC platforms to choose from. One PaC solution worth highlighting is Open Policy Agent (OPA), which is part of the CNCF (open-source) and adaptable to many use-cases. It's easy to leverage OPA with HashiCorp Terraform, making it of instant value to many organizations (as Terraform is pretty widely used). It's worth your time to evaluate your options before settling on just one PaC solution for your organization.
Starting where we're at
It's far too common in organizations to either have no controls in their pipeline or to rely on manual reviews/approvals of submitted code. Looking at an organization that uses HashiCorp Terraform, one of the two scenarios is typical. A simple Terraform pipeline without any controls:
As Terraform code is committed to the Git repo (or more specifically, when a pull/merge request is submitted), it triggers the pipeline to run. Notice how there's really no validation (beyond what each cloud's IAM supports)? The output of terraform plan is captured as part of the pipeline run log, which does give us a little bit more context around what might be changing in the environment. However, this does require manual review and is not a preventive control (more of an auditing control).
The following diagram shows a simple pipeline where changes are manually reviewed:
At least in this scenario there's some sort of manual control, but this typically results in suckurity (slow, poor user experience, and prone to mistakes).
Neither of these options are ideal. Let's look at how this could be improved with PaC.
What it might be with PaC
Here's what we might want to work towards:
The main difference here is the addition of the Compliant? decision point in the pipeline. This is an overly simplified/high-level view of things, but it does allow us to validate the code submitted by the user and ensure that it adheres to our organizational standards. If it's not within our standards, the user will be alerted (the automated pipeline throwing one or more errors) and is given the opportunity to update their code and re-submit. In situations where the submitted code is compliant, the change(s) will proceed unimpeded.
A key benefit with this approach is that regardless of whether we're using only OCI or OCI and several other cloud platforms, the security policies are there to protect us. Additionally, to set this up, it requires very little platform-specific domain expertise. Ideally the security, compliance or another designated team will provide definition and management of what is (and what is not) permissible on each cloud platform.
Here's a more detailed sample workflow which uses OPA to make the determination on whether the submitted Terraform (TF) code is acceptable or not:
Is this what we'd qualify as anything-and-everything that it ever should be? No, not at all.
Taking this approach, a "gate" (control) is present, where a go/no-go decision can be made prior to any changes being made to the environment. This helps to ensure that the proposed infrastructure changes are compliant with organizational standards (as defined in the OPA policies). Users get instant feedback and are able to move quickly, while maintaining a high level of assurance for the environment.
Why did we do this again?
We had a couple of trips down memory lane and went down several rabbit holes, however the outcome is that PaC is an ideal way to consistently manage multi-cloud environments (when using HashiCorp Terraform or similar cross-platform IaC solution). PaC provides a way to yield a high level of assurance that our corporate policies will be followed. This can be done without sacrificing speed of execution or requiring a significant ongoing time investment (once the PaC policies are created, PaC can be used to repeatedly validate code submissions).
With safeguards in place, we can safely permit teams to manage their own infrastructure, knowing that PaC will help us remain compliant. By using an integrated, automated pipeline, users are given nearly instantaneous results, which increases the agility and responsiveness of the team (and organization). We've enabled our organization to do what they need to do without compromising our security posture. We've successfully moved from suckurity to security. That's a win-win for everyone!