Wrzasq.pl

Blue-green deployments with Terraform

Thursday, 24 December 2020, 13:29

Blue-green deployments are not always easy, but cloud environments and tools that manage them make it much easier. One of the leading tools in the DevOps world designed to help you manage your resources in infrastructure-as-code manner is Terraform. This tool is just brilliant and at the same time easy to use. Still, when working exclusively with AWS cloud, I use CloudFormation considering it to be just better choice (for various reasons), and when it comes to serverless architecture I hardly can find anything simpler. But yes, this time I will use Terraform and the solution, even though using AWS resources for exemplar cases, should be portable to any infrastructure provider you would like to use. In most guides and articles on the web, which I've found infrastructure code was used to orchestrate entire landscape as one codebase - meaning you would need to modify your code reflect current stage of deployment process. This however doesn't play well with another aspect of dev-ops, which is automation usually done via CI/CD pipeline. I tried to describe blue-green deployment more like a process that allows you to decouple each step and place it within your pipeline.

What is blue-green deployment?

After Wikipedia: blue-green deployment is a method of installing changes to a web, app, or database server by swapping alternating production and staging servers. The purpose is to isolate changes from the running environment to ensure we don't break running stuff in case new version introduces some regression, worse performance, or any other problems. Usually it's achieved by deploying new version (called "green") aside, in parallel to running version (called "blue") and pointing your users to one of them at different stages of deployment process. Having both versions deployed at the same time allows you to have a fallback old version in place in case your updated version renders to be broken.

Be warned - this is not the full story

What I will explain is the main coupling part between the environments. If you want to introduce blue-green deployment strategy for your production-grade systems, you will need to fill the missing gaps - mainly the tests (without them blue-green switch doesn't even make much sense). The goal is to switch to new version only when you are sure it will not cause headache to users (or at least not more than the old version).

Another important aspect is data migration - I'm explaining only how to manage resources (particularly with Terraform), which corresponds to stateless aspect of the application. While I could present some approaches related to testing, when it comes to data migration it is even less generic: this depends how you manage your data stores (if it's part of your application deployment, or separated), what data storage you use (RDBMS, NoSQL, files, black-boxed behind service APIs), if your data schema changes between deployments, if you have schema at all… personally I think data migration is the most tricky aspect and there is no silver-bullet-solution for that, so I will cheat a little and simply skip it. At least for today.

More than two colors

Blue-green strategy focuses on just two versions: blue - current (to be old) and green - next one (to be current).

Blue-green deployment lifecycle

In most cases, however, the technical solutions and tools that allow you to diverse and isolate versions will not limit it to just two possibilities and the approach described in this article is not an exception - from purely technical point of view we operate on multiple parallel deployment environments (not being limited to just two) and we have a separate stack that accesses the resources of those environments. If you have such requirement, you can even deploy multiple parallel versions with just few changes, to allow A/B/N testing.

Workspaces

In order to work with blue-green deployments you need to able to maintain multiple environments of your application in parallel. Terraform enables you to do so with it's awesome workspaces feature. Workspaces allow you to manage multiple states of same stack in isolated storates - in other words Terraform switches between them without making any of them affecting each other.

Here comes one catch: Terraform workspaces persist states but Terraform itself always operate on "current" code. So if you would try to perform some in-place update in any workspace, Terraform would apply changes from current state of your infrastructure code. If you really need to make any in-place change, like hotfix, make sure you work with workspace that corresponds to your current environment and also that you have code in the state of your at-that-time infrastructure (for example proper Git branch). However with blue-green deployment there also very often comes concept of immutable deployments - you only deploy the resources once and if you need to change them, you deploy new resources to replace previous version, which means that such scenario as described here, should not happen or at least be a rare case.

Fine, knowing all of this, let's prepare some simple application deployment. In our example we will just deploy an EC2 instance with Nginx to serve us static page on which a current version will be mentioned:

terraform {
  required_version = ">= 0.13"

  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "3.14.1"
    }
  }

  backend "s3" {
    # your S3 storage configuration
  }
}

provider "aws" {
}

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["099720109477"] # Canonical
}

resource "aws_instance" "nginx" {
  ami                         = data.aws_ami.ubuntu.id
  instance_type               = "t3.micro"
  associate_public_ip_address = true
  # for content of `install.sh` just read further
  user_data                   = templatefile("${path.module}/install.sh", {
    version = terraform.workspace
  })
}

output "public_ip" {
  value = aws_instance.nginx.public_ip
}

Our install.sh script, which will be executed on instance startup, just needs to install Nginx:

#!/bin/bash
apt-get update -y
apt-get upgrade -y
apt-get install nginx -y
echo '<h1>v${version}</h1>' > /var/www/html/index.html

Wrapper stack

Deploying multiple environments doesn't automatically make your approach a blue-green deployment - the key part of this strategy is being able to quickly switch between environments and maintain them all while in the transition stage. On the other hand, the change should be transparent to end-users. In our application stack above we are able to spawn multiple environments by using separate workspaces - each of the deployment environment will have own instance, with own IP address. Our main goal is to hide them behind one edge endpoint that will forward traffic from the public network to selected environment. For that we will make a separate stack (and we will call it "wrapper", because I thought it's a good name, so you don't have to invent one). For simplicity I will just use Route 53 for DNS, but you can do same using load balancer, API gateway or even other features/services if you wish.

Obviously also managed by Terraform. Here is the key coupling part. Thanks to terraform_remote_state data source we can read outputs of another stack(s) so we can point our wrapper to desired workspace to read it's outputs and use them to populate public endpoint settings:

# read current public version info

variable "current" {
  type = string
}

data "terraform_remote_state" "deployment" {
  backend   = "s3"
  workspace = var.current

  config = {
    # your S3 backend configuration - it needs to correspond with your application setup from the previous paragraph
  }
}

# point Route53 to the current EC2 instance

resource "aws_route53_zone" "domain" {
  name = "your-domain.com"
}

resource "aws_route53_record" "www" {
  zone_id = aws_route53_zone.domain.zone_id
  name    = "www.your-domain.com"
  type    = "A"
  ttl     = "300"
  records = [data.terraform_remote_state.deployment.outputs.public_id]
}

In case you would like to read state outputs of multiple workspaces (eg. to do a gradual switch with weighted routing) you can do this using list of available deployments:

data "terraform_remote_state" "deployments" {
  for_each  = var.environments
  backend   = "s3"
  workspace = each.key

  config = {
    # your S3 backend configuration
  }
}

# then to access outputs of one of the environments you can use:
output "public_ip" {
  value = data.terraform_remote_state.deployments[var.current].outputs.public_ip
}

Automating actions

Deploying new stack

Enough theory, let's see how it looks in practice - and it's much simpler than you could think. In fact it's fairly simple (in it's basic form of course). First thing we will do is deploying new version of our application and we will do so by creating new workspace to isolate resources:

Terraform blue-green - deploying green

cd app-stack
terraform workspace new v2
terraform apply

Now, that we have new version deployed it's time to test - we can check if everything works as we want. However our public endpoints are still pointing to our blue deployment. We can do all of these tests before risking failures with real users.

Switching blue to green

If we performed some, at least basic, tests we can move forward. You can improve this solution to use gradual switch with more advanced load balancing, or pick some strategy, like geo-distribution - for the sake of simplicity, however, I will just do a big-bang. This is also a moment, when you can perform all of the data migration aspects, that I do not include in this article.

This one is actually the simplest part - at this time we do not even touch the deployment stacks, we only update the wrapper to pick different deployment by switching the parameters:

Terraform blue-green - switching active deployment

cd wrapper-stack
terraform apply -var current=v2

The important thing here is that we are only switching where the wrapper (endpoint) stack points. We do not touch the deployments so in case of failure we can swill just update the wrapper stack to point back to old version.

Deleting old stack

Last thing which is left is the cleanup of old environment, which we do not need anymore. When we perform all the checks and we are sure we can live with out new version, we simply delete all resources - as usual with Terraform:

Terraform blue-green - deleting blue

cd app-stack
terraform workspace select v1
terraform destroy
terraform workspace select default
terraform workspace delete v1

You won't be able to do terraform workspace delete <old> while using it as selected workspace, that's why we need to switch to default first.

In case of any failures in the new stack, instead of deleting old stack, just keep to old blue stack and delete new (failed) one.

Post scriptum - I still like CloudFormation more when targeting AWS exclusively.

Tags: , , ,