Read our newest book, Fundamentals of DevOps and Software Delivery, for free!

Part 5. How to Set Up Continuous Integration (CI) and Continuous Delivery (CD)

Headshot of Yevgeniy Brikman

Yevgeniy Brikman

JUN 25, 2024
Featured Image of Part 5. How to Set Up Continuous Integration (CI) and Continuous Delivery (CD)

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

This is Part 5 of the Fundamentals of DevOps and Software Delivery series. In Part 4, you learned several key tools and techniques that help developers work together, including version control, build systems, and automated tests. But merely having a collection of tools and techniques is not enough: you also need to know how to put them together into an effective software delivery lifecycle (SDLC). As a reminder, the SDLC needs to solve the following problems:

Code access

All the developers on your team need a way to access the same code so they can all collaborate on it.

Integration

As multiple developers make changes to the same code base, you need some way to integrate their changes, handling any conflicts that arise, and ensuring that no one’s work is accidentally lost or overwritten.

Correctness

It’s hard enough to make your own code work, but when multiple people are modifying that code at the same time, you need to find a way to prevent constant bugs and breakages from slipping in.

Release

Getting a codebase working is great, but as you may remember from the Preface, code isn’t done until it’s generating value for your users and your company, which means you need a way to release the changes in your codebase on a periodic basis.

Now that you have the ingredients for solving these problems, in this blog post, you’ll learn how to put these ingredients together. To do this, you will first learn about continuous integration (CI) and then continuous delivery (CD). The combination of the two, CI/CD, is a central part of the SDLC of all the companies you read about in Section 1.1 that have world-class software delivery practices.

Let’s start with CI.

Continuous Integration (CI)

What is continuous integration? It may be easier to understand it by comparing it with its opposite, late integration.

Imagine you’re responsible for building the International Space Station (ISS), which consists of dozens of components, as shown in Figure 41.

International Space Station
Figure 41. International Space Station [27]

Each component will be assigned to a team from a different country, and it’s up to you to decide how you will organize these teams. You have two options:

Option 1: late integration

Come up with a design for all the components up front and then have each team go off and work on their component in complete isolation until it’s finished. When all the teams are done, launch all the components into outer space, and try to put them together at the same time.

Option 2: continuous integration

Come up with an initial design for all the components and then have each team go off and start working. As they make progress, they regularly test each component with all the other components and update the design if there are any problems. As components are completed, you launch them one at a time into outer space, and assemble them incrementally.

How do you think option #1 is going to work out? In all likelihood, attempting to assemble the entire ISS at the last minute will expose a vast number of conflicts and design problems. Team A thought team B would handle the wiring while team B thought team A would do it; all the teams used the metric system, except one; no one remembered to install a toilet. Unfortunately, as everything has been fully built and is already floating in outer space, it will be expensive and difficult to go back and fix things.

Option 1 may sound ridiculous, but this is exactly the way in which many companies build software. Developers work in totally isolated feature branches in their version control system for weeks or months at a time and then, at the very last minute, when a release rolls around, they try to merge all the feature branches together. This process is known as late integration, and it often leads to disaster, as shown in Figure 42.

The huge merge conflicts that you get as a result of late integration
Figure 42. The huge merge conflicts that you get as a result of late integration

When you don’t merge your code together for long periods of time, you end up with horrible merge conflicts: two teams modified the same file, but in incompatible ways; one team has made changes in a file that another team deleted entirely; one team did a giant refactor to remove all usages of a deprecated service, but the other teams have introduced dozens of new usages of that service; and so on. All of these conflicts lead to bugs and problems that take days or weeks to stabilize, turning the release process into a long, drawn-out nightmare.

A better approach, as described in option #2, is continuous integration (CI), which is software development practice where every developer on your team merges their work together on a very regular basis: typically daily or multiple times per day. The key benefit of CI is that it exposes problems with your work earlier in the process, before you’ve gone too far in the wrong direction, and allows you to make improvements incrementally.

Key takeaway #1

Ensure all developers merge all their work together on a regular basis: typically daily or multiple times per day.

The most common way to implement continuous integration is to use a trunk-based development model, where developers do all of their work on the same branch, typically main or master or trunk, depending on what your VCS calls it; I’ll mostly refer to this branch as main in this blog post series. With trunk-based development, you no longer have long-lived feature branches. Instead, you create short-lived branches, that typically last from a few hours to a few days, and you open pull requests to get your branch merged back into main on a regular basis.

It may seem like having all developers work on a single branch couldn’t possibly scale, but the reality is that it might be the only way to scale. LinkedIn moved off of feature branches and onto trunk-based development as part of Project Inversion, which was essential for scaling the company from roughly 100 developers to over 1,000. Facebook uses trunk-based development for thousands of developers. Google uses trunk-based development for tens of thousands of developers, and has shown it can scale to 2+ billion lines of code, 86TB of source data, and around 40,000 commits per day.

If you’ve never used trunk-based development, it can be hard to imagine how it works. The same questions come up again and again:

  • Wouldn’t you have merge conflicts all the time?

  • Wouldn’t the build always be broken?

  • How do you make large changes that take weeks or months?

In the next three sections, I’ll address each of these questions, showing the tools and techniques companies use to deal with them, and then walk you through an example of how to set up some of these continuous integration tools and techniques yourself.

Dealing with Merge Conflicts

The first question that newbies to trunk-based development often ask is, won’t you be dealing with merge conflicts all the time? After all, with feature branches, each time you merge, you get days or weeks of conflicts to resolve, but at least you only have to deal with that once every few weeks or months. Whereas with trunk-based development, wouldn’t you have to fight with merge conflicts many times per day?

As it turns out, the reason that feature branches lead to painful merge conflicts is precisely because those feature branches are long-lived. If your branches are short-lived, the odds of merge conflicts are much lower. For example, imagine you have a repo with 10,000 files, and two developers working on changes in different branches. After one day, perhaps each developer has changed 10 files; if they try to merge the branches back together, the chances that some of those 20 files overlap, out of 10,000, are pretty low. But if those developers worked in those branches for three months, and changed hundreds of files in each branch during that time, then the chances that some of those files overlap and conflict are much higher.

Moreover, even if there are a merge conflicts, it’s much easier to deal with them if you merge regularly. If you’re merging two branches that are just a day old, the conflicts will be relatively small, as you can’t change all that much code in just a few days, and the code will still be top-of-mind, as you worked on it within the last 24 hours. On the other hand, if you’re merging code that is several months old, then the conflicts will be larger, as you can make a lot of changes in a few months, and you’re less likely to remember what the changes are about, as you may have worked on them months ago.

The most important thing to understand is this: when you have multiple developers working on a single codebase at the same time, merge conflicts are unavoidable, so the question isn’t how to avoid merge conflicts, but how to make those merge conflicts as painless to deal with as possible. And that’s one of many places in software delivery where Martin Fowler’s quote applies:

If it hurts, do it more often.

— Martin Fowler
Frequency Reduces Difficulty

Merge conflicts hurt. The way to make it hurt less, oddly enough, is to merge more often.

Preventing Breakages with Self-Testing Builds

The second question that newbies to trunk-based development often ask is, won’t you be dealing with breakages all the time? After all, with feature branches, each time you merge, it can take days or weeks to fix all the issues that come up and stabilize the release branch, but at least you only have to deal with that once every few weeks or months. Whereas with trunk-based development, wouldn’t you have to fight with breakages many times per day?

Have no fear: this is precisely where the automated testing practices you learned about in Part 4 come to the rescue. Companies that practice CI and trunk-based development configure a self-testing build that runs automated tests after every commit. This includes commits on any branch, so every time a developer opens a pull request to merge a branch into main, you automatically run tests against their branch, and show the test results directly in the pull request UI (you’ll see an example of how to set this up a little later in this post). That way, code that doesn’t pass your test suite doesn’t get merged to main in the first place. And if somehow some code does slip through that breaks main, then as soon as you detect it, the typical solution is to revert that commit automatically. This way, you get main back into working condition quickly, and the developer who merged in the broken code can redo their commit later, once they’ve fixed whatever caused the breakage.

The most common way to set up a self-testing build is to run a CI server, which is a piece of software that integrates with your version control system to run various automations, such as your automated tests, in response to new commits, branches, and so on. There are many CI servers out there, including some solutions that you run yourself, such as Jenkins, TeamCity, Drone, and Argo, and some solutions that are managed services, such as GitHub Actions, CircleCi, and GitLab.

CI servers are such an integral part of continuous integration, that for many developers, the two terms are nearly synonymous. This is because a CI server and a good suite of automated tests completely changes how you deliver software:

Without continuous integration, your software is broken until somebody proves it works, usually during a testing or integration stage. With continuous integration, your software is proven to work (assuming a sufficiently comprehensive set of automated tests) with every new change—and you know the moment it breaks and can fix it immediately.

— Jez Humble and David Farley
Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Professional).

Going from a default of broken to a default of working is a profound transformation. Instead of a multi-day merge process to prepare your code for release, your code is always in a releasable state—which means you can deploy whenever you want. To some extent, the role of a CI server is to act as a gatekeeper, protecting your code from any changes that jeopardize your ability to deploy at any time.

Key takeaway #2

Use a self-testing build after every commit to ensure your code is always in a working and deployable state.

In Section 1.1, you saw that companies with world-class software delivery processes are able to deploy thousands of times per day. Continuous integration—including a CI server and thorough automated test suite—is one of the key ingredients that makes this possible; you’ll see some of the other ingredients throughout this post.

Making Large Changes

The third question that newbies to trunk-based development often ask is, how do you handle changes that take a long time to implement? CI sounds great for small changes, but if you’re working on something that will take weeks or months—e.g., major new features or refactors—how can you merge your incomplete work on a daily basis without breaking the build or accidentally releasing unfinished features to users?

There are two approaches that you can use to resolve this: branch by abstraction and feature toggles. These two techniques are the focus of the next two sections.

Key takeaway #3

Use branch by abstraction and feature toggles to make large-scale changes while still merging your work on a regular basis.

Branch by abstraction

Branch by abstraction is a technique that allows you to make large-scale changes to your code incrementally, across many commits, without ever risking breaking the build or releasing unfinished work to users. For example, let’s say you have hundreds of modules in your codebase that use Library X, as shown in Figure 43:

Library X is used across many modules in your codebase
Figure 43. Library X is used across many modules in your codebase

You want to replace Library X with Library Y, but this will require updating hundreds of modules, which could take months. If you do this work in a feature branch, by the time you merge it back, there’s a good chance you’ll have merge conflicts with many of the updated modules, and it’s possible new usages will have shown up in the meantime, so you’d have even more work to do.

Instead of a feature branch, the idea with branch by abstraction is to keep working on main, but to introduce a new abstraction into the codebase. What type of abstraction you use depends on your programming language: it might be an interface, a protocol, a class, etc. The important thing is that (a) the abstraction initially uses Library X under the hood, so there is no change in behavior and (b) it creates a layer of indirection between your modules and Library X, as shown in Figure 44:

Introduce an abstraction between your modules and Library X
Figure 44. Introduce an abstraction between your modules and Library X

You can update your modules to use the abstraction incrementally, across many commits to main. There’s no hurry or risk of breakage, as under the hood, the abstraction is still using Library X. After some time, all modules should be using the abstraction, and you could even add an automated test that fails if anyone tries to use Library X directly.

At this point, you can start replacing Library X with Library Y, as shown in Figure 45:

Start to incrementally update your modules to use Library Y
Figure 45. Start to incrementally update your modules to use Library Y

Again, you can roll out this change incrementally, across many commits to main, integrating your work regularly to minimize merge conflicts. You could also update your abstraction code to ensure that any new usages of the abstraction get Library Y under the hood by default. Eventually, when you’re done updating all module usages, you can remove Library X entirely, as shown in Figure 46:

After you’ve updated all module usages, you can remove Library X
Figure 46. After you’ve updated all module usages, you can remove Library X

Branch by abstraction is a great technique for doing large-scale refactors. But what if you need to introduce totally new functionality? If that functionality takes weeks or months to implement, how can you merge it regularly into main without accidentally releasing unfinished features to users? This is when you turn to the second approach, feature toggles, as described next.

Feature toggles

The idea with feature toggles (AKA feature flags) is to wrap new functionality in conditionals that let you turn (toggle) those features on and off dynamically. For example, imagine that you wanted to take the Node.js sample app you’ve been using throughout this blog post series, and to update it to return a proper home page that is a little more interesting than the "Hello, World!" text. However, it’s going to take you several months to implement this new home page. The idea with a feature toggle is to add a conditional to your code as shown in Example 89 (you don’t need to actually make these code changes; this is just for demonstration purposes):

Example 89. An example of using a feature toggle to pick between the new home page and the original "Hello, World!" text (ch5/sample-app/app.js)
app.get('/', (req, res) => {

  if (lookupFeatureToggle(req, "HOME_PAGE_FLAVOR") === "v2") { (1)

    res.send(newFancyHomepage());                              (2)

  } else {

    res.send('Hello, World!');                                 (3)

  }

});

Here’s what this code does:

1Use the lookupFeatureToggle function to look up the value of the "HOME_PAGE_FLAVOR" feature toggle.
2If the value of the feature toggle is "v2," send back the contents of the new home page as a response.
3If the value of the feature toggle is anything else, send back the original "Hello, World!" text.

So what does the lookupFeatureToggle function do? Typically, this function will check if the feature toggle is enabled by querying a dedicated feature toggle service, which is a service that can do the following:

Store a feature toggle mapping

The mapping is from a feature toggle name (e.g, HOME_PAGE_FLAVOR) to its value (e.g., true, false, or an arbitrary string like "v2").

Look up feature toggles programmatically

You provide an API or SDK your apps can use to look up the current value of a feature toggle (e.g., the lookupFeatureToggle function would use this SDK under the hood).

Update feature toggles without having to change code

You have some sort of web UI, API, or other mechanism that lets you quickly change the value of a feature toggle at any time—without having to update or deploy new code.

You could build your own feature toggle service around a database, or deploy an open source feature toggle service such as growthbook, Flagsmith, flagr, or OpenFeature, or you could use a managed feature toggle service such as Split, LaunchDarkly, ConfigCat, or Statsig.

It might not be obvious, but the humble if-statement, combined with a feature toggle check, unlocks a superpower: you can now commit and regularly merge code, even before it’s done. This is because of the following key property of feature toggles:

The default value for all feature toggles is off.

If you wrap new features in a feature toggle check, as long as the code is syntactically valid (which you can validate with simple automated tests), you can merge your new feature into main long before that feature is done, as by default, the new feature is off, so it will have no impact on other developers or your users. This is what allows you to develop large new features while still practicing continuous integration.

What’s even more surprising is that this is only one of the superpowers you get with feature toggles; you’ll see a number of others later in this blog post, in the continuous delivery section.

Example: Run Automated Tests for Apps in GitHub Actions

Example Code

As a reminder, you can find all the code examples in the blog post series’s sample code repo in GitHub.

Now that you understand the basics of continuous integration, let’s get a little practice setting up some of the technology that enables it: namely, a self-testing build. You added some automated tests in Section 4.3, so the goal is to run these tests automatically after each commit, and to show the results in pull requests. In Section 4.1.3, you pushed your code to GitHub, so to avoid introducing even more tools, let’s use GitHub Actions as the CI server that will run these tests.

Head into the folder where you’ve been working on the code samples for this blog post series and make sure you’re on the main branch, with the latest code:

$ cd fundamentals-of-devops

$ git checkout main

$ git pull origin main

Next, create a new ch5 folder for this blog post’s code examples, and copy into ch5 the sample-app folder from Part 4, where you had a Node.js app with automated tests:

$ mkdir -p ch5

$ cp -r ch4/sample-app ch5/sample-app

With that done, create a new folder called .github/workflows in the root of your repo:

$ mkdir -p .github/workflows

$ cd .github/workflows

Inside the .github/workflows folder, create a file called app-tests.yml, with the contents shown in Example 90:

Example 90. A GitHub Actions workflow to run the sample app automated tests (.github/workflows/app-tests.yml)
name: Sample App Tests



on: push                                  (1)



jobs:                                     (2)

  sample_app_tests:                       (3)

    name: "Run Tests Using Jest"

    runs-on: ubuntu-latest                (4)

    steps:

      - uses: actions/checkout@v2         (5)



      - name: Install dependencies        (6)

        working-directory: ch5/sample-app

        run: npm install



      - name: Run tests                   (7)

        working-directory: ch5/sample-app

        run: npm test

With GitHub Actions, you use YAML to define workflows, which are configurable automated processes that run one or more jobs in response to certain triggers. Here’s what the preceding workflow does:

1The on block is where you define the triggers that will cause this workflow to run. The preceding code configures this workflow to run every time you do a git push to this repo.
2The jobs block defines one or more jobs—automations—to run in this workflow. By default, jobs run sequentially, but you can also configure jobs that run concurrently, as well as creating dependencies and passing data between jobs.
3This workflow defines just a single job, which will run the automated tests for the sample app.
4Each job runs on a certain type of runner, which is how you configure the hardware (CPU, memory) and software (operating system and dependencies) to use for the build. The preceding code uses the ubuntu-latest runner, which gives you the default hardware configuration (2 CPUs and 7GB of RAM, as of 2024) and a software configuration that has Ubuntu and a bunch of commonly used software engineering tools (including Node.js) pre-installed.
5Each job consists of a series of steps that are executed sequentially. The first step in this job runs another workflow via the uses keyword. This is one of the best features of GitHub Actions: you can share and reuse workflows, including both public, open source workflows (which you can discover in the GitHub Actions Marketplace) and private, internal workflows within your own organization. The preceding code uses the actions/checkout workflow to check out the code for your repo (it calls git clone under the hood).
6The second step in this job use the run keyword to execute shell commands. In particular, it runs npm install in the ch5/sample-app folder to install the sample app’s dependencies.
7The third step in this job uses the run keyword to execute npm test, which runs the sample-app’s automated tests.

If all the steps succeed, the job will be marked as successful (green); if any step fails—e.g., npm test exits with a non-zero exit code because one of the tests fails—then the job will be marked as failed (red).

To try it out, first commit and push the sample app and workflow code to your repo:

$ git add ch5/sample-app .github/workflows/app-tests.yml

$ git commit -m "Add sample-app and workflow"

$ git push origin main

Next, create a new branch called test-workflow to see this workflow in action:

$ git checkout -b test-workflow

Make a change to the sample app to intentionally return some text other than "Hello, World!", as shown in Example 91:

Example 91. Change the sample app to return a different response (ch5/sample-app/app.js)
res.send('Fundamentals of DevOps!');

Commit and push these changes to the test-workflow branch:

$ git add ch5/sample-app/app.js

$ git commit -m "Change response text"

$ git push origin test-workflow

After running git push, the log output will show you the GitHub URL to open a pull request. Open that URL in your browser, fill out a title and description, and click "Create pull request." You should get a page that looks something like Figure 47:

Automated tests running in a pull request
Figure 47. Automated tests running in a pull request

At the bottom of the pull request, you should see the "Sample App Tests" workflow has run: and, uh oh, looks like there’s an error. Click the Details link to the right of the workflow to see what went wrong. You should get a page that looks like Figure 48:

Looking into the cause of the test failure
Figure 48. Looking into the cause of the test failure

Aha! The automated test is still expecting the response text to be "Hello, World!" To fix this issue, update app.test.js to expect "Fundamentals of DevOps!" as a response, as shown in Example 92:

Example 92. Updated the automated test to expect the new response text (ch5/sample-app/app.test.js)
expect(response.text).toBe('Fundamentals of DevOps!');

Commit and push these changes to the test-workflow branch:

$ git add ch5/sample-app/app.test.js

$ git commit -m "Update response text in test"

$ git push origin test-workflow

This will automatically update your open PR, and automatically re-run your tests. After a few seconds, if you go back to your browser and look at the PR, you should see the tests passing, as shown in Figure 49:

The automated tests should now be passing
Figure 49. The automated tests should now be passing

Congrats, you now have a self-testing build that will automatically run your app’s tests after every commit, and show you the results in every PR. Merge the PR, and let’s move on to adding automated tests for the infrastructure code.

Get your hands dirty

Here are a few exercises you can try at home to get a better feel for running automated app tests in CI:

  • To help catch bugs, update the GitHub Actions workflow to run a JavaScript linter, such as JSLint or ESLint, after every commit.

  • To help keep your code consistently formatted, update the GitHub Actions workflow to run a code formatter, such as Prettier, after every commit.

  • Run both the linter and code formatter as a precommit hook, so these checks run on your own computer before you can make a commit. You may wish to use the pre-commit framework to manage your precommit hooks.

Machine User Credentials and Automatically-Provisioned Credentials

Now that you’ve seen how to configure a CI server to run the sample app’s automated tests, you may want to update the CI server to also run the infrastructure automated tests that you added in Part 4. In that blog post, you added two types of automated tests for your infrastructure code: static analysis with Terrascan and unit testing with OpenTofu’s test command. Since the latter type of test deploys real resources into a real AWS account, you will need to give your automated tests a way to authenticate to AWS.

This is a somewhat tricky problem. When a human being needs to authenticate to a machine, you can rely on that human memorizing some sort of secret, such as a password. But what do you do when a machine, such as a CI server, needs to authenticate to another machine? How can that machine "memorize" some sort of secret without leaking that secret to everyone else? You’ll learn various approaches to solve this problem in Part 8 [coming soon].

For now, all you need to know is that you should never use a real user’s credentials to solve this problem. That is, do not use your own IAM user credentials, or your own GitHub personal access token, or any type of credentials from any human being in a CI server or other types of automation. Here’s why:

Departures

Typically, when someone leaves a company, you revoke all their access. If you were using their credentials for automation, then that automation will suddenly break.

Permissions

The permissions that a human user needs are typically different than a machine user.

Audit logs

Most systems maintain an audit log that records who performed what actions in that system. These sorts of logs are useful for debugging and investigating security incidents—unless the same user account is used both by a human and automation, in which case, it’s harder to tell who did what.

Management

You typically want multiple developers at your company to be able to manage the automations you set up. If you use a single developer’s credentials for those automations, then the other developers won’t be able to access that user account if they need to update the credentials or permissions.

So if you can’t use the credentials of a real user, what do you do? These days, there are two main options: machine user credentials and automatically-provisioned credentials. These are the topics of the next two sections.

Key takeaway #4

Use machine user credentials or automatically-provisioned credentials to authenticate from a CI server or other automations.

Machine user credentials

One way to allow automated tools to authenticate is to create a dedicated machine user, which is a user account that is only used for automation (not by any human user). You create the user, generate credentials for them (e.g., access keys), and copy those credentials into whatever tool you’re using (e.g., into GitHub).

Machine users have a number of advantages: they never depart your company; you can assign them just the permissions they need; no human ever logs in as a machine user, so they only show up in audit logs when used in your automations; and you can share access to a single machine user account across your team by using a secrets management tool (a topic you’ll learn more about in Part 8 [coming soon]).

However, machine users also have some drawbacks: one drawback is that you have to copy their credentials around manually, which is tedious and error-prone; another drawback is that the credentials you’re copying are typically long-lived credentials, which don’t expire for a long time (if ever), so if these credentials are ever leaked, there is a long or even indefinite window of time during which they could be exploited.

With some tools, machine users are the best you can do, but these days, some systems support automatically-provisioned credentials, as described in the next section.

Automatically-provisioned credentials

A second way to allow automated tools to authenticate is to use automatically-provisioned credentials, which are credentials the system can generate automatically, without any need for you to manually create machine users or copy/paste credentials. This requires that the system you’re authenticating from (e.g. a CI server) and the system you’re authenticating to (e.g., AWS) have an integration between them that supports automatically-provisioned credentials.

You’ve actually seen one form of automatically-provisioned credentials already earlier in this blog post series: IAM roles. Some of the resources you’ve deployed in AWS, such as the EKS cluster in Part 3, used IAM roles to authenticate and make API calls within your AWS account (e.g., to deploy EC2 instances as EKS worker nodes). You didn’t have to create a machine user or manually manage credentials to make this work: instead, under the hood, AWS automatically provisioned credentials that the EKS cluster could use.

With IAM roles, the thing you’re authenticating from and the thing you’re authenticating to are both AWS, but there are also systems that support automatically-provisioned credentials across different companies and services. One of the most common is Open ID Connect (OIDC), which is an open protocol for authentication. Not all services support OIDC, so it’s not always an option, but in the cases where it is supported, it’s usually a more secure choice than machine user credentials, as OIDC gives you not only automatically-provisioned credentials (so no manual copy/paste), but also short-lived credentials that expire after a configurable period of time (e.g., one hour).

One place where OIDC is supported is between AWS and GitHub. To set up OIDC with AWS and GitHub, you configure your AWS account to trust an identity provider (IdP), such as GitHub, whose identity AWS can verify cryptographically (using a fingerprint you provide for GitHub), and then you can grant that provider permissions to assume specific IAM roles, subject to certain conditions: e.g., you can only use this IAM role from certain repos or branches.

Once you’ve set that up, Figure 50 shows the workflow for using OIDC to authenticate from GitHub to AWS:

With OIDC, you configure AWS to trust an IdP such as GitHub, which allows that IdP to exchange an OIDC token for short-lived AWS credentials
Figure 50. With OIDC, you configure AWS to trust an IdP such as GitHub, which allows that IdP to exchange an OIDC token for short-lived AWS credentials

Here are the steps in the workflow:

  1. [GitHub] Generate an OIDC token: Inside a GitHub Actions workflow, GitHub generates an OIDC token, which is a JSON Web Token: a JSON object that contains claims—that is, data that GitHub is asserting—and a signature that can be cryptographically verified to prove the token really comes from GitHub. GitHub includes several claims, including information about what repo and branch the workflow is running in.

  2. [GitHub] Call the AssumeRoleWithWebIdentity API: The workflow then calls the AWS AssumeRoleWithWebIdentity API, specifying an IAM Role to assume, and passing the OIDC token to AWS as authentication.

  3. [AWS] Validate the OIDC token: AWS first validates the signature on the token to make sure it really came from GitHub, using a thumbprint you provide when setting up GitHub as an OpenID provider.

  4. [AWS] Validate IAM role conditions: Next, AWS validates the conditions on the IAM role against the claims in the token, especially whether that particular repo and branch is allowed to assume the IAM role.

  5. [AWS] Grant short-lived AWS credentials: If all the validations pass, AWS generates temporary AWS credentials that give you access to the IAM role’s permissions for a short period of time, and sends those back to GitHub.

  6. [GitHub] Use the AWS credentials: Finally, the tools in your GitHub Actions workflow, such as OpenTofu, can use the AWS credentials to authenticate to AWS and make changes in your AWS account.

Since OIDC is a more secure option than machine user credentials, let’s try it out in the next section.

Example: Configure OIDC with AWS and GitHub Actions

Let’s set up an OIDC provider and IAM roles so that the automated tests you wrote for the lambda-sample OpenTofu module in Part 4 can authenticate to AWS from GitHub Actions. The first step is to set up the OIDC provider. The blog post series’s sample code repo includes an OpenTofu module called github-aws-oidc in the ch5/tofu/modules/github-aws-oidc folder that you can use to configure GitHub as an OIDC provider.

Switch back to the main branch, pull down the latest changes (i.e., the PR you just merged), and create a new branch called opentofu-tests:

$ git checkout main

$ git pull origin main

$ git checkout -b opentofu-tests

Next, create a new folder for a root module called ci-cd-permissions:

$ mkdir -p ch5/tofu/live/ci-cd-permissions

$ cd ch5/tofu/live/ci-cd-permissions

In the ci-cd-permissions folder, create main.tf with the initial contents shown in Example 93:

Example 93. Configure the github-aws-oidc module (ch5/tofu/live/ci-cd-permissions/main.tf)
provider "aws" {

  region = "us-east-2"

}



module "oidc_provider" {

  source = "github.com/brikis98/devops-book//ch5/tofu/modules/github-aws-oidc"



  provider_url = "https://token.actions.githubusercontent.com" (1)

}

This code sets the following parameters:

1provider_url: The URL of the IdP. The preceding code sets this to the URL GitHub uses for OIDC. The github-aws-oidc module will also use this URL to fetch GitHub’s fingerprint, which AWS will use to cryptographically validate OIDC tokens.

In addition to the OIDC provider, you also need to create an IAM role that you can assume from GitHub Actions (using OIDC) for testing. The blog post series’s sample code repo has a module for that too: it’s called gh-actions-iam-roles, it lives in the ch5/tofu/modules/gh-actions-iam-roles folder, and it knows how to create several IAM roles for CI/CD with GitHub Actions. Example 94 shows how to update your ci-cd-permissions module to make use of the gh-actions-iam-roles module:

Example 94. Configure the gh-actions-iam-roles module (ch5/tofu/live/ci-cd-permissions/main.tf)
module "oidc_provider" {

  # ... (other params omitted) ...

}



module "iam_roles" {

  source = "github.com/brikis98/devops-book//ch5/tofu/modules/gh-actions-iam-roles"



  name              = "lambda-sample"                           (1)

  oidc_provider_arn = module.oidc_provider.oidc_provider_arn    (2)



  enable_iam_role_for_testing = true                            (3)



  # TODO: fill in your own repo name here!

  github_repo      = "brikis98/fundamentals-of-devops-examples" (4)

  lambda_base_name = "lambda-sample"                            (5)

}

This code configures the following parameters:

1name: The base name for the IAM roles and all other resources created by this module. The preceding code sets this to "lambda-sample," so the IAM role for testing will be called "lambda-sample-tests."
2oidc_provider_arn: Specify the OIDC provider that will be allowed to assume the IAM roles created by this module. The preceding code sets this to the OIDC provider you just created using the github-aws-oidc module. Under the hood, the gh-actions-iam-roles module will configure the trust policy in the IAM roles to trust this OIDC provider and allow it to assume the IAM roles.
3enable_iam_role_for_testing: If set to true, create the IAM role specifically for automated testing. You’ll see the other IAM roles this module can create later in this blog post.
4github_repo: The GitHub repo that will be allowed to assume the IAM roles using OIDC. You will need to fill in your own GitHub repo name here. Under the hood, the gh-actions-iam-roles module sets certain conditions in the trust policies of each IAM role to specify which repos and branches in GitHub are allowed to assume that IAM role. For the testing IAM role, all branches in the specified repo will be allowed to assume the IAM role.
5lambda_base_name: The base name you use for the lambda-sample module and all the resources it creates. This should be the same value you use for the name parameter in that module. This is necessary so the gh-actions-iam-roles module can create IAM roles that only have permissions to manage the lambda-sample resources, and no other resources.

You should also create a file called outputs.tf that outputs the testing IAM role ARN, as shown in Example 95:

Example 95. The output variables for the ci-cd-permissions module (ch5/tofu/live/ci-cd-permissions/outputs.tf)
output "lambda_test_role_arn" {

  value = module.iam_roles.lambda_test_role_arn

}

Deploy this module as usual: authenticate to AWS as described in Authenticating to AWS on the command line, and run init and apply:

$ tofu init

$ tofu apply

After apply completes, you should see an output variable:

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.



Outputs:



lambda_test_role_arn = "arn:aws:iam::111111111111:role/lambda-tests"

Take note of the lambda_test_role_arn output value, as you’ll need it soon.

Now that OIDC provider and IAM role are in place, you can finally set up the automated tests for your infrastructure code.

Example: Run Automated Tests for Infrastructure in GitHub Actions

To run the automated tests for your infrastructure code in GitHub Actions, you first need the infrastructure code itself. Copy over the lambda-sample module that had automated tests from Part 4, as well as the test-endpoint module that those tests used under the hood:

$ cd fundamentals-of-devops

$ mkdir -p ch5/tofu/modules

$ cp -r ch4/tofu/live/lambda-sample ch5/tofu/live

$ cp -r ch4/tofu/modules/test-endpoint ch5/tofu/modules

Now you have the code to test, but you should make some changes to it before running those tests in a CI environment. In a CI environment, you may have many tests running concurrently, which is a good thing, as it can help reduce test times. However, the lambda-sample module currently hard-codes the names of all of its resources (e.g., it hard-codes the name of the Lambda function, the IAM role, and so on), so if several developers are running that test concurrently in CI, you’ll get errors due to name conflicts, as AWS requires Lambda function and IAM role names to be unique.

To fix this issue, the first step is to add a variables.tf file to the lambda-sample module with the contents shown in Example 96:

Example 96. Define an input variable for the lambda-sample module (ch5/tofu/live/lambda-sample/variables.tf)
variable "name" {

  description = "The base name for the function and all other resources"

  type        = string

  default     = "lambda-sample"

}

This defines a name variable which you can use to namespace all the resources created by this module. The default value is "lambda-sample," which is exactly the value the module used before, so the default behavior doesn’t change, but by exposing this input variable, you’ll be able to override the value at test time.

Next, update main.tf to use var.name instead of any hard-coded names, as shown in Example 97:

Example 97. Update the lambda-sample module to use the name input variable instead of hard-coded names (ch5/tofu/live/lambda-sample/main.tf)
module "function" {



  # ... (other params omitted) ...



  name = var.name

}

module "gateway" {



  # ... (other params omitted) ...



  name = var.name

}

Now you can create a new workflow called infra-tests.yml in .github/workflows, with the initial contents shown in Example 98:

Example 98. The first half of a GitHub Actions workflow to run the infrastructure automated tests (.github/workflows/infra-tests.yml)
name: Infrastructure Tests



on: push



jobs:

  terrascan:

    name: "Run Terrascan"

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v2



      - name: Run Terrascan

        uses: tenable/terrascan-action@main

        with:

          iac_type: 'terraform'

          iac_dir: 'ch5/tofu/live/lambda-sample'

          verbose: true

          non_recursive: true

          config_path: 'ch5/tofu/live/lambda-sample/terrascan.toml'

This workflow, which runs on push, contains two jobs. The preceding code just shows the first job, which uses an open source workflow to install and run Terrascan, passing it the same parameters as when you ran it manually in Part 4.

Example 99 shows the second half of the workflow:

Example 99. The second half of a GitHub Actions workflow to run the infrastructure automated tests (.github/workflows/infra-tests.yml)
  opentofu_test:

    name: "Run OpenTofu tests"

    runs-on: ubuntu-latest

    permissions:                                                                (1)

      id-token: write

      contents: read

    steps:

      - uses: actions/checkout@v2



      - uses: aws-actions/configure-aws-credentials@v3                          (2)

        with:

          # TODO: fill in your IAM role ARN!

          role-to-assume: arn:aws:iam::111111111111:role/lambda-sample-tests    (3)

          role-session-name: tests-${{ github.run_number }}-${{ github.actor }} (4)

          aws-region: us-east-2



      - uses: opentofu/setup-opentofu@v1                                        (5)



      - name: Tofu Test

        env:

          TF_VAR_name: lambda-sample-${{ github.run_id }}                       (6)

        working-directory: ch5/tofu/live/lambda-sample

        run: |                                                                  (7)

          tofu init -backend=false -input=false

          tofu test -verbose

The second half of the workflow adds a job to run OpenTofu tests:

1By default, every GitHub Actions job gets contents: read permissions in your repo, which allows that job to check out the code in the repo. In order to use OIDC, you need to add the id-token: write permissions. This will allow you to issue an OIDC token for authenticating to AWS in (2).
2Use an open source workflow to authenticate to AWS using OIDC. This calls the AssumeRoleWithWebIdentity API to exchange the OIDC token for temporary AWS credentials.
3The IAM role to assume. Make sure to fill in the IAM role ARN from the lambda_test_role_arn output in the previous section.
4The name to use for the session when assuming the IAM role. This shows up in audit logging, so the preceding code includes useful information in the session name, such as the name of the tests, which run number this in GitHub, and which GitHub user triggered the workflow.
5Use an open source workflow to install OpenTofu.
6Use the environment variable TF_VAR_name to set the name input variable of the lambda-sample module to a value that includes the GitHub actions run ID, so it will be unique for each test run, and therefore, avoid problems with running multiple tests concurrently.
7Kick off the tests by running tofu init and tofu test. Note that the init command sets backend=false to skip backend initialization. Later in this post, you’ll start using remote backends with the lambda-sample module, which is useful for deployment, but not something you want to enable at test time.

Add, commit, and push all the changes to the opentofu-tests branch, and then open a pull request. You should see something similar to Figure 51:

A PR showing the sample app unit tests, Terrascan, and OpenTofu tests running
Figure 51. A PR showing the sample app unit tests, Terrascan, and OpenTofu tests running

Congrats, you should now have both automated tests for your app code and for your infrastructure code running, as you can see at the bottom of the PR! After a minute or two, if everything is configured correctly, and the tests are passing, merge the PR.

Get your hands dirty

Here’s an exercise you can try at home to get a better feel for running automated infrastructure tests in CI:

  • To help keep your code consistently formatted, update the GitHub Actions workflow to run a code formatter, such as tofu fmt, after every commit.

You now have a self-testing build that runs your sample app automated tests, Terrascan, and OpenTofu tests after every commit. If you keep growing this suite of automated tests, and you regularly integrate changes from all of your developers, then your code will always be in a deployable state. But how do you actually do the deployments? That’s the topic of the next section.

Continuous Delivery (CD)

Continuous delivery (CD) is a software development practice where you ensure that you can deploy to production at any time in a manner that is fast, reliable, and sustainable. You could choose to deploy daily, several times a day, thousands of times per day, or even after every single commit that passes the automated tests; this last approach is known as continuous deployment. The key with CD is not how often you deploy, but to ensure that the frequency of deployment is purely a business decision—not something limited by your technology.

Key takeaway #5

Ensure you can deploy to production at any time in a manner that is fast, reliable, and sustainable.

If you’re used to a painful deploy process that happens only once every few weeks or months, then deploying many times per day may sound like a nightmare—and deploying thousands of times per day probably sounds utterly impossible. But this is yet another place where, if it hurts, you need to do it more often.

To make it possible to deploy more often—and more importantly, to make it possible to deploy any time you want—you typically need to fulfill two requirements:

  1. The code is always in a deployable state: You saw in the previous section that this is the key benefit of practicing CI. If everyone is integrating their work regularly, and you have a self-testing build with a sufficient suite of tests, then your code will always be ready to deploy.

  2. The deployment process is sufficiently automated: If you have a deployment process that involves many manual steps, then you can’t really practice CD, because manual deployments typically aren’t fast, reliable, or sustainable. CD requires that you automate your deployment process.

This section focuses on item (2), automating the deployment process. Managing your infrastructure as code, using the tools in Part 2, gets you a large part of the way there. To get the rest of the way there, you need to automate the process around using IaC. This includes implementing deployment strategies and a deployment pipeline, as discussed in the next two sections.

Deployment Strategies

There are many deployment strategies that you can use to roll out changes: some involve downtime, while others do not; some are easy to implement, while others are more complicated; some only work with stateless apps, which are apps that don’t need to persist any of the data that they store on their local hard drives (e.g., most web frontend apps are stateless), while others also work with stateful apps, which are apps that store data on their local hard disks that needs to be persisted across deployments (e.g., any sort of database or distributed data system).

The next several sections will go over some of the most common strategies used today, including a basic overview of each strategy, its advantages and drawbacks, and common use cases where the strategy is typically a good fit. Note that these strategies are not mutually exclusive: you can, and often do, combine multiple strategies together.

Downtime deployment

This is the most basic deployment strategy, where you take a downtime to roll out changes, as shown in Figure 52:

Downtime deployment
Figure 52. Downtime deployment
  1. You start with several replicas of v1 of your app running.

  2. You take all the v1 nodes down to update them to v2. While the update is happening, your users get an outage.

  3. Once the deployment is completed, you have v2 running everywhere, and your users are able to use the app again.

Advantages of this strategy:

Easy to implement

This is the simplest and most basic deployment strategy.

Works with all types of apps

You can use this strategy with both stateless apps and stateful apps.

Drawbacks of this strategy:

Downtime

Users have to suffer through an outage while you do the deployment.

Use cases where the strategy is typically a good fit:

Single-replica systems

If you have a system with only a single replica, then when you go to update that one replica, you may have no choice other than taking a downtime. As you learned in Part 3, this is one of many reasons to run more than a single replica of your app.

Data migrations

If you are doing a large data migration, doing it without downtime (which often requires multiple migration steps and replicating all writes across multiple systems) is often 10x more expensive and error-prone than doing it with a brief downtime.

Except for the handful of use cases I just mentioned, I would not recommend using downtime deployments, as these days, there is wide support for the zero-downtime deployment strategies discussed in the following sections.

Rolling deployment without replacement

This is the deployment strategy you saw in Part 3, where you gradually roll out new versions of your app onto new servers, and once the new versions of the app start to pass health checks, you gradually remove the old versions of the app, as shown in Figure 53:

Rolling deployment without replacement
Figure 53. Rolling deployment without replacement
  1. You start with several replicas of v1 of your app running.

  2. You start deploying v2 of your app onto new servers. Once the v2 apps come up and start passing health checks, the load balancer will send traffic to them, so for some period of time, users may see both v1 and v2 of your app.

  3. As the v2 apps start passing health checks, you gradually undeploy the v1 apps, until you end up with just v2 running.

Advantages of this strategy:

No downtime

Your app keeps working for your users during deployments.

Widely supported

Most deployment tools natively support rolling deployments without replacement.

Drawbacks of this strategy:

Poor UX

During a rolling deployment, users may see both the old and new versions of your app at the same time, which can be a jarring user experience, or even cause bugs if you’re not careful.

Works only with stateless apps

This version of rolling deployment doesn’t work with stateful apps, as you deploy the v2 replicas before taking down the v1 replicas, so those v1 replicas are still using their hard-drives. In the next section, you’ll see a version of rolling deployment that does work with stateful apps.

Use cases where the strategy is typically a good fit:

Deploying stateless apps

For stateless apps, a rolling deployment without replacement can be an effective option.

As you’ll see a little later in this post, blue-green deployments are typically a better choice for stateless apps. However, relatively few systems support blue-green deployments natively, whereas rolling deployment without replacement is widely supported (you saw several examples of it in Part 3), so it’s often your best bet for stateless apps.

Rolling deployment with replacement

This is nearly identical to the rolling deployment in the previous section, except here, you remove the old version of the app before booting up the new version, as shown in Figure 54:

Rolling deployment with replacement
Figure 54. Rolling deployment with replacement
  1. You start with several replicas of v1 of your app running, each with a hard-drive attached. These are typically network-attached hard-drives.

  2. You disconnect one[28] v1 replica from the load balancer, shut down the server, and move its hard drive to a new v2 server (since it’s a network-attached hard-drive, you do the move through software). Once that new v2 server starts passing health checks, the load balancer starts sending traffic to it.

  3. You repeat this process with each v1 server, taking it out of the load balancer rotation, shutting it down, and moving its hard-drive to a new v2 server, until all replicas are replaced with v2.

Advantages of this strategy:

No downtime

Your app keeps working for your users during deployments.

Works with all types of apps

You can use this strategy with both stateless and stateful apps.

Widely supported

Most deployment tools natively support rolling deployments with replacement.

Drawbacks of this strategy:

Limited support for hard-drive replacement

While most deployment tools support rolling deployment with replacement, only a small subset of those tools natively support moving hard-drives over while the deployment is happening.

Poor UX

During a rolling deployment, users may see both the old and new versions of your app at the same time, which can be a jarring user experience, or even cause bugs if you’re not careful.

Use cases where the strategy is typically a good fit:

Deploying stateful apps

This strategy allows you to do zero-downtime deployments for stateful systems where each replica has a unique set of data on its local hard-drive that needs to be persisted across the deployment (e.g., distributed data stores such as Consul, Elasticsearch, and ZooKeeper).

When it comes to stateful apps, rolling deployment with replacement is the gold standard. However, for stateless apps, blue-green deployment is the gold standard, as discussed in the next section.

Blue-green deployment

With blue-green deployments, you bring up the new (green) version of your app, wait for it to be fully ready, and then instantaneously switch all traffic from the old version (blue) to the new version (green), as shown in Figure 55:

Blue-green deployment
Figure 55. Blue-green deployment
  1. You start with several replicas of v1 of your app running. Let’s refer to these v1 apps as "blue."

  2. You start deploying v2 of your app, which we’ll refer to as "green," onto new servers. The v2 apps start to go through health checks in the load balancer, but the load balancer does not yet send any traffic to them.

  3. When all the v2 replicas are passing health checks, you do an instantaneous switchover, moving all traffic from v1 (blue) to v2 (green). At that point, you undeploy all the v1 servers, leaving just v2.

Advantages of this strategy:

No downtime

Your app keeps work for your users during deployments.

Good UX

During a deployment, your users see only one version of the app or the other, but not both, so you avoid a jarring user experience and potential bugs.

Drawbacks of this strategy:

Limited support

Only a small subset of tools natively support blue-green deployments.

Works only with stateless apps

With stateless apps, you typically can’t instantaneously move data over to the new version, so blue-green deployments aren’t an option (without downtime).

Use cases where the strategy is typically a good fit:

Deploying stateless apps

Blue-green deployments are the gold standard for deploying stateless apps.

All the deployment strategies you’ve seen so far can be use standalone. Let’s now turn our attention to some strategies that are meant to be combined with other strategies, starting with canary deployment.

Canary deployment

This is not a standalone deployment strategy, but a strategy meant to be combined with other strategies, such as rolling deployment or blue-green deployment, to reduce the risk of broken deployments by testing new code on a single replica before doing a full rollout.

The name "canary" comes from the proverbial "canary in the coal mine," which is a bird that coal miners would take into mines with them, as canaries are more sensitive to poisonous gasses than humans, so if the canary starts reacting poorly or dies, it’s an early warning signal that you need to get out immediately. The idea with canary deployments is similar: you deploy your new code on solely a single replica initially, and if that replica shows any problems, you roll back the deployment before it can cause more damage, as shown in Figure 56:

Canary deployment
Figure 56. Canary deployment
  1. You start with several replicas of v1 of your app running.

  2. You deploy a single replica of v2, called the canary server, and send traffic to it. You then compare the canary server to a randomly-chosen older (v1) server, called the control. If you see any differences—e.g., the canary has higher error rates or higher memory usage than the control—this gives you an early warning that the deployment has problems, and you can roll it back before it does too much damage.

  3. If you can’t find any differences between the canary and the control, then you can roll out v2 fully using one of the other strategies, such as rolling deployment or blue-green deployment.

Advantages of this strategy:

Catch errors early

Before they affect too many of your users.

Drawbacks of this strategy:

Poor UX

During a canary deployment, a small percentage of your users may see both the old and new versions of your app at the same time, which can be a jarring user experience, or even cause bugs if you’re not careful.

Use cases where the strategy is typically a good fit:

Large deployments

Where even a small percentage of traffic can give you meaningful data.

Risky deployments

Where a full-scale outage would cause significant problems for your business.

Canary deployments offer one way to reduce the blast radius if a deployment goes wrong. If you combine canary deployments with feature toggle deployments, which are discussed in the next section, you can reduce the risk of deployments even further.

Feature toggle deployment

You saw feature toggles earlier in Section 5.1.3.2 as a technique for being able to merge code into main regularly, even while making large-scale changes. It turns out that feature toggles can also have a profound impact on how you deploy software, too. This is also not a standalone deployment strategy, but a strategy meant to be combined with other strategies, such as rolling deployment or blue-green deployment. Figure 56 shows an overview of feature toggle deployment:

Feature toggle deployment
Figure 57. Feature toggle deployment
  1. You start with several replicas of v1 of your app running.

  2. You deploy v2 of your app using one of the other strategies, such as rolling deployments or blue-green deployments, but with a key difference: any new features in the new version are wrapped in a feature toggle—and off by default. Therefore, the deployment itself doesn’t release any new functionality: that is, users won’t see any differences as a result of v2 being deployed.

  3. After the deployment is done, you can then enable v2 using your feature toggle service, and only then will users start to see different functionality.

Advantages of this strategy:

Separate deployment from release

Without feature toggles, every time you deploy new code (e.g., roll out a new Docker image into a Kubernetes cluster), you also automatically release every single new feature in that code, all at once. With feature toggles, the deployment and release steps are now separate, which makes deployments considerably less risky. This is another one of the key ingredients that makes it possible for the companies with world-class software delivery processes mentioned in Section 1.1 to deploy thousands of times per day.

Resolve issues without deploying new code

Not only do feature toggles allow you to release features separately from deploying new code, but they also allow you to unrelease features without code changes. That is, if you enable a feature toggle, and you start seeing problems (bugs, performance issues, outages), you can just as quickly disable that feature toggle to turn the feature off. In many cases, this gives you a way to resolve issues that is much faster than having to write and deploy new code. It’s one of the big reasons the companies mentioned in Section 1.1 can recover from downtime 700-4000x faster.

Ramp new features

A remarkable benefit of separating deployment from release is that it allows you to ramp features gradually, rather than them being on for all users all at once. For example, at LinkedIn, one of the changes from Project Inversion was to require all new features to be wrapped in feature toggles, and to ramp them up gradually; Facebook, Google, and many other companies use similar processes. Every new feature starts off disabled by default, and when it’s ready for testing, we’d first turn it on only for employees, so that we could test it internally; if you work at companies like LinkedIn, Facebook, or Google, your experience of those products can be very different from that of the general public. Once things are looking good in internal testing, we could then ramp the feature up, turning it on for, say, a random 1% of users. We’d then observe those users, looking at their error rates to make sure there were no problems. If everything looked OK, we’d ramp the feature to 10% of users. After another round of observation, we’d ramp to 50%, and eventually to 100%. If we hit issues at any point, we could pause the ramp, or ramp back down.

A/B test features

Feature toggles also give you the ability to do A/B testing (AKA bucket testing), where you can compare how different versions of your product perform against each other. For example, you could randomly split your users into two buckets, a bucket A with the new feature enabled, and a bucket B with the new feature disabled, and compare how the users perform at key metrics across the two buckets. For example, did the new feature increase engagement? Downloads? Purchases? Referrals? This is just like a scientific experiment, with control and experimental groups: as long as (a) you randomly assign users to buckets, (b) the only difference between the buckets is the new feature, and (c) you gather enough data for it to be statistically significant,[29] then you can be reasonably confident that any difference in metrics between the buckets is due to the new feature. In other words, you are using data to establish a causal relationship!

This is sometimes called data-driven product development, and if you have the type of product where you can do it (i.e., you can show users different versions of the product, and you have sufficient traffic to generate statistically significant results), it can be transformational.[30]

Drawbacks of this strategy:

Requires an extra service

To use feature toggles, you have to run and maintain an extra feature toggle service, or pay for one from a 3rd party.

Forked code

Over time, as you add more and more if-statements with feature toggle lookups, you get more and more forks in your code. This makes the code harder to maintain and test. If you’re going to use feature toggles, you’ll need to create the discipline (and automation) to ensure that you systematically remove if-statements for feature toggles that are unlikely to ever change again (e.g., feature toggles greater than 1 year old).

Use cases where the strategy is typically a good fit:

All new feature development

The ability to separate deployment from release, carefully ramp new features, and quickly shut off features that are causing issues is such a huge advantage in agility, that once you get past a certain scale as a company, you should consider wrapping all new features in feature toggles.

Data-driven development

Feature toggles are an incredibly powerful tool for product teams, as they give you the ability to do A/B testing and data-driven development.

If you’re paged at 3AM because of an outage, the ability to disable the feature causing the outage in a few clicks, so you can all go back to sleep and put in a more permanent fix during normal working hours, truly feels like a superpower. Almost all companies that have world-class software delivery processes make heavy use of feature toggles. Most of these companies also promote changes from environment to environment, as discussed next.

Promotion deployment

This is yet another strategy that isn’t a standalone strategy, but meant to be combined with other strategies, such as rolling deployment or blue-green deployment. The idea with promotion deployments (AKA promotion workflows) is to deploy your code across multiple environments, starting with internal pre-production environments, and ending up in your production environment, with the hope that you can catch issues in the pre-production environments before they affect production (you’ll learn more about multiple environments in Part 6 [coming soon]), as shown in Figure 58:

Promotion deployment
Figure 58. Promotion deployment
  1. Let’s say you have three environments: dev, stage, and prod. Initially, v1 of your app is running in all three of those environments.

  2. You use one of the other deployment strategies (e.g., rolling deployment or blue-green deployment) to deploy v2 across the dev environment, and do a round of testing in dev.

  3. If everything works well in dev, you deploy exactly the same v2 code—also known as promoting v2—to the stage environment, and do a round of testing in stage.

  4. If everything works well in stage, you finally promote v2 to prod.

Advantages of this strategy:

Multiple chances to catch errors

You get a chance to test your code in pre-prod environments before that exact same code goes to prod.

Drawbacks of this strategy:

Requires multiple environments

You have to deploy and maintain multiple environments, instead of just one.

Use cases where the strategy is typically a good fit:

All deployments

The benefits of having pre-prod environments to test in are so significant, that once you get past a certain scale as a company, you should consider using multiple environments and promotion workflows for all deployments.

If you use multiple environments as a company (e.g., dev, stage, prod), something you’ll learn more about in Part 6 [coming soon], you should almost certainly use promotion workflows as well. Moreover, if you manage your infrastructure as code, promotion workflows are essential for automating infrastructure deployments, as discussed in the next section.

Infrastructure deployment

Except for promotion workflows, just about all the deployment strategies in the previous sections are only applicable to deploying application code: e.g., apps written in Java, Ruby, Python, JavaScript, etc. When it comes to infrastructure code (e.g., OpenTofu, Pulumi, CloudFormation), the deployment strategies that are available are much more limited. Typically, it’s binary: either you make an infrastructure change, or you don’t; either you create (or delete!) that database, or you don’t; there’s no gradual rollout, no feature toggles, no canaries, etc. That makes infrastructure deployments harder and riskier. The typical strategy used to mitigate those risks comes down to the following two steps:

  1. Validate plan output: Assuming your infrastructure tool supports some sort of plan or dry-run operation, you should always analyze the plan output before deploying changes to an environment. For example, with OpenTofu, you can integrate running the plan command into your pull request workflow, so you can review not only the code changes, but also the plan output, before merging changes in. You’ll see an example of this later in this blog post.

  2. Use a promotion workflow: Promote infrastructure changes from environment to environment, just as you saw in the previous section. For example, you deploy the same infrastructure code first in dev, then in stage, and then in prod, with a period of testing in each environment before moving onto the next one.

Advantages of this strategy:

Works with infrastructure deployments

The strategies in this section work with most types of infrastructure changes.

Even more chances to catch errors

You not only get a chance to test your code in pre-prod environments before that exact same code goes to prod, but you also get to check the plan output for each environment before deploying code into that environment.

Drawbacks of this strategy:

Requires multiple environments

You have to deploy and maintain multiple environments, instead of just one.

Use cases where the strategy is typically a good fit:

All infrastructure deployments

The benefits of having both plan output before a deployment to an environment, and having pre-prod environments to test in before prod, are so significant, that once you get past a certain scale as a company, you should consider using this approach for all infrastructure deployments.

Now that you’ve seen all the basic deployment strategies, let’s turn our attention to how to implement these strategies as code in the form of deployment pipelines.

Deployment Pipelines

A deployment pipeline is the process you use to go from an idea to live code that affects your users. It consists of all the steps you must go through on the way to release. Deployment pipelines are different at every company, as they are effectively capturing your company’s processes, policies, and requirements as code, but most pipelines include the following:

Commit

How do you get code into version control? Do you use a pull-request based process? Do you use trunk-based development?

Build

What compilation and build steps do you need? How do you package the code?

Test

What automated tests do you run against the code? What manual tests?

Review

What review processes do you use? Who has to sign off and approve merges and deployments?

Deploy

How do you get the new code into production? How do you release new functionality to users?

Typically, you run a deployment pipeline on a deployment server, and not a developer’s computer (you’ll see later in this blog post why). The most common option is to use the same server you use for CI, such as the ones you saw earlier in the post (e.g., GitHub Actions, CircleCi, and GitLab). Another option is to use deployment servers that are designed for a specific technology: for example, for OpenTofu and Terraform, you might use the HashiCorp Cloud Platform, env0, Scalr, Spacelift, or Atlantis.

You also need to pick a language for defining your pipeline as code. Again, the most common option is to use the workflow definition language that comes with your CI server: e.g., GitHub Actions workflows are defined in YAML. Other options include defining workflows in scripting languages (e.g., Ruby, Python, Bash), your build system’s language (e.g., NPM, Maven, Make), and, a relatively recent option is to use a tool designed for defining workflows that can run on a variety of platforms, such as Dagger or Common Workflow Language. In many cases, a deployment pipeline will use multiple languages and tools together.

The best way to understand deployment pipelines is to see an example, which is the focus of the next several sections. After that, you’ll learn about deployment pipeline best practices that apply to any company and any pipeline.

Example: configure an automated deployment pipeline in GitHub Actions

To avoid introducing too many new tools, let’s stick to using GitHub Actions as the deployment server and GitHub Actions YAML workflows as the primary language for defining the pipeline. The goal is to implement the pipeline shown in Figure 59 for the lambda-sample module:

The steps of a typical deployment pipeline (left) and the example technologies you’ll use (right)
Figure 59. The steps of a typical deployment pipeline (left) and the example technologies you’ll use (right)

Here’s how this pipeline works:

  1. Commit code to a branch in your VCS: The first step is to make some code changes in a branch.

  2. Open a pull request: Once the changes are ready to review, you open a PR.

  3. Run automations for open pull request: Your deployment server runs automations on the open PR, such as compiling the code, static analysis, functional tests (e.g., unit tests, integration tests, etc.), and generating the plan output by running tofu plan.

  4. Review and merge the pull request: Your team members review the PR, plus the outputs of the automations (e.g., test results, plan output), and if everything looks good, merge the PR in.

  5. Run automations for the merged pull request: Finally, your deployment server runs automations for the merged PR, such as compiling the code, static analysis, functional tests, and lastly, deploying the changes by running tofu apply.

This type of pipeline, where you mostly drive actions through operations in Git (e.g., commits, branches, and pull requests) is often referred to as a GitOps pipeline. As it turns out, you’ve implemented most of this GitOps pipeline in this blog post already, as part of setting up automated tests in the CI section. The only items missing are the following:

  • When you open a PR, run plan on the lambda-sample module.

  • When you merge a PR, run apply on the lambda-sample module.

To add these items, you need to do the following steps:

  • Use a remote backend for OpenTofu state

  • Add IAM roles for infrastructure deployments in GitHub Actions

  • Define a pipeline for infrastructure deployments

The following three sections will cover each of these steps.

Example: use a remote backend for OpenTofu state

In Section 2.5.2, you learned that, by default, OpenTofu uses the local backend to store OpenTofu state in .tfstate files on your local hard drive. This works fine when you’re learning and working alone, but if you want to use OpenTofu as a team, you need a way to share these state files. You might be tempted to use version control, but that’s not a good idea for the following reasons:

Manual error

It’s too easy to forget to pull down the latest changes from version control before running OpenTofu or to push your latest changes to version control after running OpenTofu. It’s just a matter of time before someone on your team runs OpenTofu with out-of-date state files and, as a result, accidentally rolls back or duplicates previous deployments.

Locking

Most version control systems do not provide any form of locking that would prevent two team members from running tofu apply on the same state file at the same time.

Secrets

All data in OpenTofu state files is stored in plain text. This is a problem because certain OpenTofu resources need to store sensitive data. For example, if you use the aws_db_instance resource to create a database, OpenTofu will store the username and password for the database in a state file in plain text, and you shouldn’t store plain text secrets in version control (something you’ll learn more about in Part 8 [coming soon]).

This is why in Part 4, you added .tfstate files to .gitignore, so as not to accidentally check them in.

Instead of using version control, the best way to share state files in a team is to use a supported remote backend, such as Amazon Storage Service (S3), Azure Storage, Google Cloud Storage (GCS), Consul, or Postgres. Remote backends solve the three issues just listed:

Manual error

After you configure a remote backend, OpenTofu will automatically load the state file from that backend every time you run plan or apply, and it’ll automatically store the state file in that backend after each apply, so there’s no chance of manual error.

Locking

Most of the remote backends natively support locking. To run tofu apply, OpenTofu will automatically acquire a lock; if someone else is already running apply, they will already have the lock, and you will have to wait. You can run apply with the -lock-timeout=<TIME> parameter to instruct OpenTofu to wait up to TIME for a lock to be released (e.g., -lock-timeout=10m will wait for 10 minutes).

Secrets

Most of the remote backends natively support encryption in transit and encryption at rest of the state file. Moreover, those backends usually expose ways to configure access permissions, so you can control who has access to your state files and the secrets they might contain.

If you’re using OpenTofu with AWS, Amazon’s managed file store, S3, is typically your best bet as a remote backend for the following reasons:

  • It’s a managed service, so you don’t need to deploy and manage extra infrastructure to use it.

  • It’s designed for 99.999999999% durability and 99.99% availability, which means you don’t need to worry too much about data loss or outages.

  • It supports encryption, which reduces worries about storing sensitive data in state files.

  • It supports locking via DynamoDB (more on this shortly).

  • It supports versioning, so every revision of your state file is stored, and you can roll back to an older version if something goes wrong.

  • It’s inexpensive, with most OpenTofu usage easily fitting into the AWS Free Tier.

To enable remote state storage with Amazon S3, you must first create an S3 bucket and DynamoDB table. The blog post series’s sample code repo includes a module called state-bucket in the ch5/tofu/modules/state-bucket folder which can create an S3 bucket to store OpenTofu state, including:

  • Enabling versioning on the S3 bucket so that every update to a file in the bucket actually creates a new version of that file. This allows you to see older versions of the file and revert to those older versions at any time, which can be a useful fallback mechanism if something goes wrong.

  • Turning server-side encryption on by default for all data written to the S3 bucket. This ensures that your state files, and any secrets they might contain, are always encrypted on disk when stored in S3.

  • Blocking all public access to the S3 bucket. S3 buckets are private by default, but as they are often used to serve static content—e.g., images, fonts, CSS, JS, HTML—it is possible, even easy, to make the buckets public. Since your state files may contain sensitive data and secrets, it’s worth adding this extra layer of protection to ensure no one on your team can ever accidentally make this S3 bucket public.

The state-bucket module can also create a DynamoDB table for OpenTofu locking. DynamoDB is Amazon’s distributed key-value store. It supports strongly consistent reads and conditional writes, which are all the ingredients you need for a distributed lock system. Moreover, it’s completely managed, so you don’t have any infrastructure to run yourself, and it’s inexpensive, with most OpenTofu usage easily fitting into the AWS free tier.

To use the state-bucket module, first check out the main branch of your own repo, and make sure you have the latest code:

$ cd fundamentals-of-devops

$ git checkout main

$ git pull origin main

Next, create a new folder called tofu-state to use as a root module:

$ mkdir -p ch5/tofu/live/tofu-state

$ cd ch5/tofu/live/tofu-state

Within the tofu-state folder, create a main.tf file with the contents shown in Example 100:

Example 100. Configure the state-bucket module (ch5/tofu/live/tofu-state/main.tf)
provider "aws" {

  region = "us-east-2"

}



module "state" {

  source = "github.com/brikis98/devops-book//ch5/tofu/modules/state-bucket"



  # TODO: fill in your own bucket name!

  name = "fundamentals-of-devops-tofu-state"

}

This code sets just one parameter, name, which will be used as the name of the S3 bucket and DynamoDB table. Note that S3 bucket names must be globally unique among all AWS customers. Therefore, you must change the name parameter from "fundamentals-of-devops-tofu-state" (which I already created) to your own name. Make sure to remember this name and take note of what AWS region you’re using because you’ll need both pieces of information again a little later on.

To create the S3 bucket and DynamoDB table, run init and apply as usual:

$ tofu init

$ tofu apply

Once apply is done, you can start using the S3 bucket and DynamoDB table for state storage. To do that, you need to update your OpenTofu modules with a backend configuration. As a first step, add a backend.tf file to the tofu-state module with the contents shown in Example 101:

Example 101. OpenTofu code to use an S3 bucket and DynamoDB table as a backend (ch5/tofu/live/tofu-state/backend.tf)
terraform {

  backend "s3" {

    # TODO: fill in your own bucket name here!

    bucket         = "fundamentals-of-devops-tofu-state" (1)

    key            = "ch5/tofu/live/tofu-state"          (2)

    region         = "us-east-2"                         (3)

    encrypt        = true                                (4)

    # TODO: fill in your own DynamoDB table name here!

    dynamodb_table = "fundamentals-of-devops-tofu-state" (5)

  }

}

Here’s what this code does:

1Configure the S3 bucket to use as a remote backend. Make sure to fill in your own S3 bucket’s name here.
2The filepath within the S3 bucket where the OpenTofu state file should be written. You can use a single S3 bucket and DynamoDB table to store the state file for many different modules so long as you ensure that each module gets a unique key (filepath) for its state file.
3The AWS region where you created your S3 bucket.
4Setting encrypt to true ensures that your OpenTofu state will be encrypted on disk when stored in S3. You already enabled default encryption in the S3 bucket itself, so this is here as a second layer to ensure that the data is always encrypted.
5The DynamoDB table to use for locking. Make sure to fill in your own DynamoDB table’s name here.

Run tofu init one more time, and you should see something like this:

$ tofu init



Initializing the backend...

Do you want to copy existing state to the new backend?

  Pre-existing state was found while migrating the previous "local" backend

  to the newly configured "s3" backend. No existing state was found in the

  newly configured "s3" backend. Do you want to copy this state to the new

  "s3" backend? Enter "yes" to copy and "no" to start with an empty state.



  Enter a value:

OpenTofu will automatically detect that you already have a state file locally and prompt you to copy it to the new S3 backend. If you type yes and hit ENTER, you should see the following:

Successfully configured the backend "s3"! OpenTofu will automatically

use this backend unless the backend configuration changes.

With this backend enabled, OpenTofu will automatically pull the latest state from this S3 bucket before running a command and automatically push the latest state to the S3 bucket after running a command, and it’ll use DynamoDB locks to handle concurrent access.

You should make the same change in the lambda-sample module as well, adding the backend.tf file shown in Example 102:

Example 102. Update the lambda-sample module to use S3 as a backend (ch5/tofu/live/lambda-sample/backend.tf)
terraform {

  backend "s3" {

    # TODO: fill in your own bucket name here!

    bucket         = "fundamentals-of-devops-tofu-state" (1)

    key            = "ch5/tofu/live/lambda-sample"       (2)

    region         = "us-east-2"

    encrypt        = true

    # TODO: fill in your own DynamoDB table name here!

    dynamodb_table = "fundamentals-of-devops-tofu-state" (3)

  }

}

This is identical to the backend.tf in the tofu-state module, but note three things:

1Just as in the tofu-state module, you’ll need to fill in the name of your own S3 bucket here.
2The key value for the lambda-sample module must be different than the tofu-state module, so they don’t overwrite each other’s state!
3Just as in the tofu-state module, you’ll need to fill in the name of your own DynamoDB table here.
Get your hands dirty

If you’re like me, you’re probably annoyed by all the copy/paste you need to do with these backend configurations. Unfortunately, OpenTofu does not support using variables or any other kind of logic in backend blocks, so some amount of copy/paste is necessary. However, you can try out one of the following approaches to significantly reduce the code duplication:

To finish up the remote state setup, do the following two steps:

  1. Run init on the lambda-sample module to set up remote state storage, just as you did with the tofu-state module.

  2. Commit your changes to the lambda-sample and tofu-state modules and push them to main.

Now that you have a remote backend set up, you can move onto the next step, which is setting up IAM roles that will allow you to do deployments from GitHub Actions.

Example: add IAM roles for infrastructure deployments in GitHub Actions

Earlier in this blog post, you configured an OIDC provider to give GitHub Actions access to your AWS account for running automated tests. Now you need a way to give GitHub Actions access to your AWS account for deployments. Normally, you would deploy to a totally separate environment (separate AWS account) from where you run automated tests, so you’d need to configure a new OIDC provider in your deployment environment. However, to keep things simple in this post, let’s use the same AWS account for both deployment and testing (you’ll learn how to set up additional environments in Part 6 [coming soon]). That allows you to use the same OIDC provider; however, you still need to create new IAM roles for the following reasons:

  • The permissions you need for automated tests are different than those for deployment.

  • The permissions for deployment should be managed via two separate IAM roles: one for plan and one for apply. That’s because you want plan to run on any branch before a PR has merged, whereas you want apply only to run on main after a PR has merged. Since the plan portion runs before merge—before anyone has had a chance to review the code changes—the IAM role you use for plan should be limited to read-only permissions: enough to see the plan output, but not enough to make any changes.

Open up main.tf in the ci-cd-permissions module and add the code shown in Example 103 to enable creating IAM roles for both plan and apply:

Example 103. Update the ci-cd-permissions module to enable IAM roles for plan and apply (ch5/tofu/live/ci-cd-permissions/main.tf)


module "iam_roles" {



  # ... (other params omitted) ...



  enable_iam_role_for_plan  = true                                (1)

  enable_iam_role_for_apply = true                                (2)



  # TODO: fill in your own bucket and table name here!

  tofu_state_bucket         = "fundamentals-of-devops-tofu-state" (3)

  tofu_state_dynamodb_table = "fundamentals-of-devops-tofu-state" (4)

}

This code does the following:

1Enable the IAM role for plan. This IAM role will get read-only permissions. The OIDC provider will be allowed to assume this role from any branch.
2Enable the IAM role for apply. This IAM role will get both read and write permissions. The OIDC provider will only be allowed to assume this role from the main branch. This ensures that only merged PRs can be deployed.
3Configure which S3 bucket to use for Tofu state. Make sure to fill in your own S3 bucket’s name here. The plan role will get read-only access to this bucket; the apply role will get read and write access.
4Configure which DynamoDB table to use for Tofu state. Make sure to fill in your own DynamoDB table’s name here. The plan role will get read-only access to this table; the apply role will get read and write access.

Next, update outputs.tf with two new output variables that contain the ARNs of the two new IAM roles, as shown in Example 104:

Example 104. Add output variables for the two new IAM roles (ch5/tofu/live/ci-cd-permissions/outputs.tf)
output "lambda_deploy_plan_role_arn" {

  value = module.iam_roles.lambda_deploy_plan_role_arn

}



output "lambda_deploy_apply_role_arn" {

  value = module.iam_roles.lambda_deploy_apply_role_arn

}

Run apply to create the new IAM roles and take note of the lambda_deploy_plan_role_arn and lambda_deploy_apply_role_arn outputs; you’ll need them shortly!

Commit your changes to the ci-cd-permissions module and push them to main. You’re now finally ready to define the deployment pipeline itself!

Get your hands dirty

Here are a couple of exercises you can try at home to get a better feel for IAM roles:

  • Open up the code for the gh-actions-iam-roles module and read through it. What permissions, exactly, is the module granting to those IAM roles? Why?

  • Create your own version of the gh-actions-iam-roles module that you can use for deploying other types of infrastructure, and not just Lambda functions: e.g., try to create IAM roles for deploying EKS clusters, EC2 instances, and so on.

Example: define a pipeline for infrastructure deployments

With all the prerequisites out of the way, you can finally implement a deployment pipeline for the lambda-sample module that will do the following:

  • When you open a PR, run plan on the lambda-sample module.

  • When you merge a PR, run apply on the lambda-sample module.

Watch out for snakes: this is a very simplified pipeline

The pipeline described here represents only a small piece of a real-world deployment pipeline.[31] It’s missing several important aspects, including:

App build

The lambda-sample module contains a source folder with a dirt-simple "Hello, World" app that doesn’t have any dependencies, tests, etc. In real-world deployment pipelines, you usually need to include steps to build the app, including compiling the code, running tests, packaging the app (e.g., as a Docker image), packaging static assets (e.g., minification, fingerprinting, etc.) and so on.

Multiple environments

In this blog post, you’ll be deploying the lambda-sample module into a single test account. In real-world usage, your deployment pipeline usually needs to support multiple environments (e.g., dev, stage, prod), and the ability to promote changes from one environment to the next. You’ll learn more about this in Part 6 [coming soon].

Security

The simplified example in this post doesn’t do much in terms of locking down the deployment pipeline. In real-world usage, you’d want to configure approval workflows, access controls (especially on who can edit workflows), and a number of other checks to secure your pipeline.

Let’s first create a workflow for the plan portion. Create a new file called .github/workflows/tofu-plan.yml with the contents shown in Example 105:

Example 105. The workflow to run tofu plan (.github/workflows/tofu-plan.yml)
name: Tofu Plan



on:

  pull_request: (1)

    branches: ["main"]

    paths: ["ch5/tofu/live/lambda-sample/**"]



jobs:

  plan:

    name: "Tofu Plan"

    runs-on: ubuntu-latest

    permissions:

      pull-requests: write (2)

      id-token: write

      contents: read

    steps:

      - uses: actions/checkout@v2



      - uses: aws-actions/configure-aws-credentials@v3

        with:

          # TODO: fill in your IAM role ARN!

          role-to-assume: arn:aws:iam::111111111111:role/lambda-sample-plan (3)

          role-session-name: plan-${{ github.run_number }}-${{ github.actor }}

          aws-region: us-east-2



      - uses: opentofu/setup-opentofu@v1



      - name: tofu plan (4)

        id: plan

        working-directory: ch5/tofu/live/lambda-sample

        run: |

          tofu init -no-color -input=false

          tofu plan -no-color -input=false -lock=false



      - uses: peter-evans/create-or-update-comment@v4 (5)

        if: always()

        env:

          RESULT_EMOJI: ${{ steps.plan.outcome == 'success' && '✅' || '⚠️' }}

        with:

          issue-number: ${{ github.event.pull_request.number }}

          body: |

            ## ${{ env.RESULT_EMOJI }} `tofu plan` output

            ```${{ steps.plan.outputs.stdout }}```

This workflow has a few things you haven’t seen before:

1Instead of running on push, this workflow runs on pull requests. More specifically, only on pull requests against the main branch that have modifications to the ch5/tofu/live/lambda-sample folder. In a real-world pipeline, you may want to expand this to all modules in the tofu folder: e.g., ch5/tofu/live/**.
2Add the pull-request: write permission so in (5), the workflow can post a comment on your pull request.
3Assume the plan IAM role. Make sure to fill in your own IAM role ARN here from the lambda_deploy_plan_role_arn output variable you got in the last section.
4Run tofu init and tofu plan, passing a few flags to ensure the commands run well in a CI environment: i.e., disable terminal colors and interactive prompts. There’s also a flag to disable locking, as you don’t need that for plan.
5Use an open source workflow to post a comment on the pull request that contains the plan output. The comment is formatted in Markdown, which GitHub natively supports, and includes not only the plan output, but also a ✅ or ⚠️ emoji to help you see at a glance if the plan command ran successfully or exited with an error, respectively. This allows your team members to review the code and plan output all in one place.

Next, create a workflow for the apply portion in a new file called .github/workflows/tofu-apply.yml, with the contents shown in Example 106:

Example 106. The workflow to run tofu apply (.github/workflows/tofu-apply.yml)
name: Tofu Apply

on:

  push: (1)

    branches: ["main"]

    paths: ["ch5/tofu/live/lambda-sample/**"]

jobs:

  apply:

    name: "Tofu Apply"

    runs-on: ubuntu-latest

    permissions:

      pull-requests: write

      id-token: write

      contents: read

    steps:

      - uses: actions/checkout@v2



      - uses: aws-actions/configure-aws-credentials@v3

        with:

          # TODO: fill in your IAM role ARN!

          role-to-assume: arn:aws:iam::111111111111:role/lambda-sample-apply (2)

          role-session-name: apply-${{ github.run_number }}-${{ github.actor }}

          aws-region: us-east-2



      - uses: opentofu/setup-opentofu@v1



      - name: tofu apply (3)

        id: apply

        working-directory: ch5/tofu/live/lambda-sample

        run: |

          tofu init -no-color -input=false

          tofu apply -no-color -input=false -lock-timeout=60m -auto-approve



      - uses: jwalton/gh-find-current-pr@master (4)

        id: find_pr

        with:

          state: all



      - uses: peter-evans/create-or-update-comment@v4 (5)

        if: steps.find_pr.outputs.number

        env:

          RESULT_EMOJI: ${{ steps.apply.outcome == 'success' && '✅' || '⚠️' }}

        with:

          issue-number: ${{ steps.find_pr.outputs.number }}

          body: |

            ## ${{ env.RESULT_EMOJI }} `tofu apply` output

            ```${{ steps.apply.outputs.stdout }}```

This workflow is similar the one for plan, but with a few key differences:

1Run only on pushes to the main branch that have modifications to the ch5/tofu/live/lambda-sample folder.
2Assume the apply IAM role. Make sure to fill in your own IAM role ARN here from the lambda_deploy_apply_role_arn output variable you got in the last section.
3Run tofu init and tofu apply, again passing a few flags to ensure the commands run well in a CI environment. Note also the use of the -lock-timeout=60m to ensure this command will wait up to 60 minutes if someone else has a lock (e.g., a concurrent apply being run by a previous merge).
4If this push came from a pull request, use an open source GitHub Action to find the ID of that pull request so that you can add the output of apply as a comment in (5).
5If the previous step found a pull request ID, post a comment to the pull request with the apply output. Again, this includes the ✅ or ⚠️ emoji to quickly let you know if apply succeeded, as well as the log output from apply in case you need to debug a problem.

Commit these new workflow files directly to the main branch and then push them to GitHub:

$ git add .github/workflows

$ git commit -m "Add plan and apply workflows"

$ git push origin main

Now, let’s give this deployment pipeline a shot. First, create a new branch called deployment-pipeline-test:

$ git checkout -b deployment-pipeline-test

Make a change to the lambda-sample module, such as changing the text it returns, as shown in Example 107:

Example 107. Update the Lambda function response text (ch5/tofu/live/lambda-sample/src/index.js)
exports.handler = (event, context, callback) => {

  callback(null, {statusCode: 200, body: "Fundamentals of DevOps!"});

};

And make sure to similarly update the assertion in the automated test in deploy.tftest.hcl, as shown in Example 108:

Example 108. Update the Lambda module tests (ch5/tofu/live/lambda-sample/deploy.tftest.hcl)
  assert {

    condition     = data.http.test_endpoint.response_body == "Fundamentals of DevOps!"

    error_message = "Unexpected body: ${data.http.test_endpoint.response_body}"

  }

Commit both of these changes, push them to the deployment-pipeline-test branch, open a pull request, and you should see a page that looks like Figure 60:

The deployment pipeline running in a GitHub PR
Figure 60. The deployment pipeline running in a GitHub PR

You should see four things running in your pipeline:

  • Automated tests for the sample app.

  • Terrascan for your infrastructure code.

  • Tofu test for your infrastructure code.

  • tofu plan on the lambda-sample module.

When everything has finished, the PR should automatically be updated with a comment that shows the plan output, as shown in Figure 61:

After opening the PR, the deployment pipeline will add a comment with the plan output
Figure 61. After opening the PR, the deployment pipeline will add a comment with the plan output

Now you can review the code changes and plan output, and if everything looks good, merge the PR. This will kick off the apply workflow, and after a minute or two, it should post a comment with the apply output, as shown in Figure 62:

After merging the PR, the deployment pipeline will add a comment with the apply output
Figure 62. After merging the PR, the deployment pipeline will add a comment with the apply output

Congrats, you now have a basic deployment pipeline in place for your lambda-sample module! It runs tests, it runs plan, and it runs apply.

Get your hands dirty

Here are a few exercises you can try at home to get a better feel for deployment pipelines:

  • Update the pipeline to automatically detect changes in an any folder with OpenTofu code (rather than only the lambda-sample folder), and to automatically run plan and apply in each one. The open source changed-files action can be helpful here.

  • If a pull request updates multiple folders with OpenTofu code, have the pipeline run plan and apply across multiple folders concurrently by using a matrix strategy.

Deployment pipeline best practices

Now that you’ve seen the basics of deployment pipelines, and an example of how to implement one, let’s go through the best practices for deployment pipelines:

Automate all the steps that can be automated

Every deployment pipeline includes steps that must be done by humans, such as writing code, reviewing code, and perhaps manual testing and verification. All the other steps should be completely automated. Remember, it’s only continuous delivery if it is fast, reliable, and sustainable. These three things are precisely where computers excel over humans: whereas humans are slow at performing a bunch of steps, computers can run automated processes extremely quickly; whereas humans make mistakes all the time while running manual processes, computers carry out automated processes in a way that is predictable and repeatable; and whereas humans can get frustrated from repeating the same steps over and over again (especially if those steps sometimes cause an outage), computers never get tired or stressed.

Deploy only from a deployment server

Not only should most of your deployment pipeline be automated, but in most cases, all of that automation should only run on a dedicated deployment server and not any developer’s computer. In most cases, the deployment server is your CI server (e.g., GitHub Actions, GitLab, Jenkins), but for some types of pipelines, you may have separate, dedicated deployment tools (e.g., ArgoCD in a Kubernetes Cluster, HashiCorp Cloud Platform, Atlantis, etc.). Here’s why:

Full automation

One of the benefits of forcing the deployment pipeline to run entirely in a deployment server is that it forces you to fully automate everything that can be automated. There is a surprisingly big gap between a pipeline that is mostly automated, but still requires a few manual steps here and there, and one that is fully automated: if your pipeline relies on even a few manual steps, it can dramatically reduce the effectiveness of your ability to deliver software.

Something magical happens when you get to full automation: it’s only when the whole pipeline runs from pushing a single button, that you get a CD pipeline that is fast, reliable, and sustainable; that you have environments that are truly reproducible; that you can achieve world-class results like the companies mentioned in Section 1.1, who are able to deploy thousands of times per day.

Repeatability

If developers run deployments from their own computers, you’ll run into problems due to differences in how their computers are configured: for example, different operating systems, different dependency versions (e.g., different versions of OpenTofu or Node.js installed locally), different configurations, and differences in what’s actually being deployed (e.g., the developer accidentally deploys a change that wasn’t committed to version control). You can eliminate all of these issues by deploying everything from a dedicated deployment server that provides a consistent, repeatable environment.

Permissions management

Instead of giving developers permissions to deploy, you can give solely the deployment server those permissions (especially for the production environment). It’s easier to enforce good security practices for a single server than it is to do for numerous developers with production access.

Protect the deployment server

To be able to do automated deployments from a server, you have to give the server access to sensitive permissions, such as AWS credentials. In fact, to deploy arbitrary infrastructure changes—e.g., to be able to run tofu apply on arbitrary OpenTofu modules—you need arbitrary permissions, which is just a fun way of saying "admin permissions." So deployment servers are a terrifying combination of (a) access to powerful, sensitive permissions, (b) accessible to every developer in your company, and (c) designed to execute arbitrary code. This is why deployment servers are particularly tempting targets for malicious actors.

Here are a few things you can do to protect your deployment server:

Lock down your deployment server

Make it accessible solely over HTTPs, require all users to be authenticated, ensure all actions are logged, and so on. If possible, don’t even allow the deployment server to be accessed over the public Internet: e.g., lock it down so you can only access it from your company’s offices or over a VPN connection. You’ll learn more about networking in Part 7 [coming soon] and security functionality such as HTTPS and authentication in Part 8 [coming soon].

Lock down your version control system

Since deployment servers typically execute workflows and code in your version control system, if an attacker can slip malicious code into one of your repos, they can bypass most other protections. Therefore, it’s critical that you protect your VCS, as described in Section 4.1.5.5.

Enforce an approval workflow

Configure your deployment pipeline to require that every deployment is approved by at least one person other than the person who requested the deployment in the first place. This ensures that if one developer account is compromised, you always have a second set of eyes to hopefully catch malicious code before it can take effect.

Limit permissions before approval/merge

A common workflow to have in an infrastructure deployment pipeline is to run plan when you open a PR and to run apply after the PR has been approved and merged. One thing you have to be careful about is that the permissions you grant to the plan and apply steps: the plan step should only have access to read permissions; the apply step should have access to both read and write permissions. If you use the same set of permissions for both, then a malicious actor could open a PR that immediately uses the write permissions to make whatever changes they want—bypassing all approval workflows.

Don’t give the deployment server long-lived credentials

Whenever possible, use automatically-managed, short-lived credentials (e.g., OIDC) instead of manually-managed, long-lived credentials (e.g., machine user access keys). That way, if a malicious actor does manage to get access to those credentials, there is a short time window during which they can use them, and then they expire.

Limit the permissions of each pipeline

Instead of a single deployment pipeline that deploys arbitrary code, and therefore needs arbitrary (admin) permissions, create multiple pipelines, each of which is designed for specific tasks. You might partition your pipelines based on the type of task (e.g., deploy Kubernetes apps, databases, networking) or by team (e.g., search team, analytics team, networking team), and the idea is to grant each pipeline a limited set of permissions it needs for that set of tasks. You can also restrict access to each pipeline so only the developers who need to use it have access to it. This limits the damage an attacker can do from compromising any single developer account or pipeline.

Limit what the pipeline can do with its permissions

In addition to limiting the permissions you grant to each pipeline, you can also limit what developers can do with those permissions. For example, you might not want developers to be able to execute arbitrary code in pipelines that have access to powerful permissions, so you might put in checks that developers can only run specific commands (e.g., tofu apply), on code from specific repos (e.g., repos where your team keeps it’s OpenTofu modules), on specific branches (e.g., main), and in specific folders (e.g., search team members can only make changes in the search folder). You should also lock down the workflow definitions themselves, so only a trusted set of admins can update them, and only with PR approval from at least one other admin.[32]

Setting up a deployment pipeline the right way is a lot of work: this shouldn’t be too surprising, as what you’re effectively trying to do is to capture your company’s processes, rules, and culture in the form of Bash scripts and YAML workflow files. That isn’t easy. But it’s worth it. Think of it this way: your infrastructure code (which you learned about in Part 2) and your CI / CD pipeline (which you learned about in this blog post) are essentially your company’s custom API for shipping software. Get the API right, and as you saw in Part 1, you can accelerate your company by a factor of 10x, 100x, or more.

Conclusion

In this blog post and the previous one, you made great strides in automating your entire SDLC through the use of CI/CD, allowing your team to work and collaborate as per the 5 key takeaways from this chapter:

  • Ensure all developers merge all their work together on a regular basis: typically daily or multiple times per day.

  • Use a self-testing build after every commit to ensure your code is always in a working and deployable state.

  • Use branch by abstraction and feature toggles to make large-scale changes while still merging your work on a regular basis.

  • Use machine user credentials or automatically-provisioned credentials to authenticate from a CI server or other automations.

  • Ensure you can deploy to production at any time in a manner that is fast, reliable, and sustainable.

Virtually every company that has a world-class software delivery process, as the ones you heard about in Section 1.1, relies heavily on CI/CD to allow them to go fast. This is one of the surprising realizations of real-world systems: agility requires safety. With cars, speed limits are determined not by the limits of engines—most cars can easily go over 100 mph—but by safety, where the safety mechanisms we have today, such as brakes, bumpers, and seat belts, are just not sufficient to protect you at speeds significantly over 60 mph.

The same is true with software delivery: the limit of how fast you can build software is usually not determined by how fast a developer can build new features, but by how quickly you can get those features to your users without causing bugs, outages, security incidents, and other problems. That’s why CI/CD is all about putting safety mechanisms in place, such as automated tests, code reviews, and feature toggles, so that you can release software faster without putting your product and users at risk. The more you can limit the risk—the safer you can make it for developers to release features—the faster you can go.

As your company grows, you’re going to start to hitting new bottlenecks that limit your ability to go fast. Some of these bottlenecks will be from forces outside your company: more users, more load, more money, more requirements. Some of these bottlenecks will be from forces within your company: more products, more teams, more developers. To be able to handle these new demands, you will need to learn how to work with multiple environments and multiple teams, which is the focus of Part 6, How to Work with Multiple Teams and Environments [coming soon].

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

Join the Fundamentals of DevOps Newsletter!