Operating in the cloud to run your applications requires a different approach to traditional datacenter hosting, especially when it comes to security. This approach means that instead of trusting everything inside a network perimeter, trust becomes specific to identities as operating in the cloud gives us too many dynamic variables that would typically be static in the datacenter hosting world. For example, IP addresses in the datacenter were typically static and very easy to whitelist; however, in the cloud, especially when your application is made up of microservices, the IP addresses become dynamic and in some cases are next to impossible to predict.
HashiCorp Vault, in combination with Hashicorp Consul, are a secrets management platform and service discovery platform respectively that can be used to address this problem by providing secrets management and encryption as a service.
Both of these platforms can be installed using a single binary each on VMs in the cloud to create highly available clusters that are always available and ready to serve you users and applications with secrets that they are authorised to access.
A typical cluster deployed in the cloud would consist of a three node Vault cluster, made up of a leader and two followers, as well as a five node Consul cluster which acts as a storage backend for Vault. There are many different architecture options when designing clusters, Which are discussed in more detail in my blog series entitled Architecting Vault but for the this guide, we will use the architecture in the diagram below which is widely considered to be best practice and in accordance with the Hashicorp Vault Reference Architecture guide.
This document is my self-proclaimed Best Practice Guide for deploying and operating Hashicorp Vault in Microsoft Azure Cloud along with some tips to make use of the Azure ecosystem in harmony with Vault. This is based on many years of experience working in the cloud, especially Microsoft Azure, and Multiple Hashicorp Vault Deployments, as well as a series of Research and Development experiments. Each section will contain either a best practice take away or an optional tip.
To build a Virtual Machine in Azure, there are three main components we need to provision to make a functioning VM ready for Consul and Vault, Network Interface Cards (NICs), Virtual OS/Data disks and the VM itself.
When creating Virtual disks, we have the option of utilising Azure Key Vault to generate and store an encryption key that can be used to encrypt the Virtual Disks. This is called a Key Encryption Key (KEK).
Vault will encrypt data before storing it in Consul or any other chosen backend by default; however, the data stored in Consul by Vault is still of a sensitive nature, even in its encrypted state and where possible, additional safeguards should be put in place.
When deploying a Vault platform to Azure, The KEK is the ideal additional security layer that provides extra mitigation against the risk of unauthorised access to your Vault data and as such, should be implemented.
Best Practice Takeaway Always use Encrypted managed disks for OS and data disks for consul and vault
Network Security Groups
Vault being a secrets management platform means there is a requirement to protect its accessibility from unknown and unauthorised networks, including the public internet. Security can never be absolute so a key part of our role as Architects and Engineers, is to mitigate the risks
Azure Network Security Groups (NSG) are firewall-like Access Control Lists that we can attach to the virtual network interface cards to control what networks and or IP addresses can access the Virtual Machine as well as which ports the Virtual Machine can be accessed via.
Using NSG is recommended when deploying Virtual Machines for Consul and Vault as it has a default deny policy which requires explicit exclusions to allow access to the VM. This will allow access to be restricted to Organisation’s networks and their respective VPN networks.
Best Practice Takeaway Always use Azure Network Security Groups to restrict connectivity to your Azure hosted Vault clusters.
Azure Key Vault
When setting up Hashicorp Consul and Vault clusters for production, it is best practice to secure communications by configuring the servers to use TLS. One of the challenges of setting up TLS and a wider challenge in general where secrets management is concerned, is secure introduction of initial secrets, in this case, secure injection of certificates into the Virtual machines.
Using Azure Key Vault, the Trusted Platform Orchestrator model of secure secrets introduction can be applied whereby we are delegating trust to Azure Key vault to securely inject TLS certificates into the VMs. VMs can be bootstrapped with certificates stored in Azure Key Vault.
Best Practice Takeaway Adopt the Trusted Platform Orchestrator model to securely inject TLS certificates into the VMs.
Azure Shared Image Gallery
Building immutable infrastructure is about speed, idempotency and repeatability to guarantee that the same version of infrastructure built from a codebase is always in the same state as described in the codebase no matter how many times you run the code and provision said infrastructure in a quick and reliable manner.
There are many ways to build your Vault and Consul servers, from bash scripts using cloud-init to Ansible playbooks, the choice is yours as to what methods and tools you prefer to use. Whichever way you wish to install and configure your servers, we need a workflow that builds the image once and all consequent VMs are built from this image.
Hashicorp Packer is an open source tool that can be used to build Virtual Machine images for all of the major cloud providers, including Azure. Whether your preferred method of installing and configuring your servers is via bash scripts or using Ansible, Packer has a provisioner that will enable you to build your Virtual Machine Images in a way that suits you.
Once you have a Virtual Machine Image produced, it’s important to make it available for your organisation to use as appropriate, across different subscriptions if necessary. The recommended way of achieving this is to use Azure’s Shared Image Gallery to store and share your Virtual Machine Images.
This provides some advantages:
Images can be versioned
Sharing of the images can be controlled as per your business and security requirements using RBAC controls built into Azure
Images can be replicated to multiple regions for a better deployment experience
End User License Agreement can be packaged with the image offering
You can manage the full life cycle of your image.
Best Practice Takeaway Create reusable machine images, version and store them in Azure Shared Image Gallery. Always deploy your consul and vault clusters from these images.
Cloud auto-join is a feature of Consul that allows Consul nodes to join a cluster based on tags assigned to the Virtual NICs in the cloud. Prior to Hashicorp implementing this feature, we would normally have to specify a list of IP addresses or DNS names of nodes we want to join the cluster like the example below:
The limitation of this method is that anytime you wish to expand the cluster or wish to use dynamic IP addresses, you would need to update this config file on every consul node in the cluster which is an additional operational overhead that we could do without. Taking advantage of Cloud auto-join means we don’t have to specify IP addresses, instead, we can specify a Tag that we will assign to the network interface cards of any VMs we wish to join the cluster and Consul will automatically pick these up.
Once this configuration is in place, we simply add the tags specified in the config to the network interface cards of our VMs that we wish to join the cluster and Consul takes care of the rest for you.
Tip Optional – Use cloud auto-join if you wish to deploy consul clusters with having to specify IP addresses.
When a Vault node is first started up, it is in a sealed state which protects the Vault from being accessed by preventing all but three actions, Checking the Vault status, initialising the Vault and unsealing the Vault. By default, the master key in Vault is divided into 5 unseal keys, of which a threshold of any 3 of those 5 keys needs to be entered to unseal a Vault. The idea behind this is that the keys are distributed to trusted operators, each having only one key so that no one person can unseal the Vault alone.
Whilst this is a secure practice, it isn’t the most operationally friendly way of doing things as a server restart in the middle of the night would require you to wake up three engineers to enter their unseal keys to unseal the vault.
Vault allows us to leverage Azure Key Vault to store a master key and use this to automatically unseal the vault. This solves the operational overheads I have just described.
To implement this, we need to create an Azure Key vault, create a key inside that key vault and then add a seal stanza to the Vault config file like this he below example:
Tip Optional – Use Vault auto-unseal with azure key vault to reduce the operational overhead of the Vault servers, especially for unplanned server restarts.
Managed Service Identities
In order to implement Consul cloud auto-join and Vault auto-unseal, you can see from the configuration examples above that we have had to pass in some sensitive values in the form of the client_secret and client_id. Having these hardcoded in the config files introduces a new security risk as anyone who has access to the file will now have credentials to log in to Azure programmatically and perform any actions that the Service Principal has permissions to undertake.
This also creates a barrier for putting this config under version control, even if your VCS is private or self-hosted. In the cloud operating model, we need to practice a zero-trust approach, which means only principals, human or machine, that require credentials to perform a task, know what that credential is.
One approach to solving this problem has been the use of environment variables; however, when this approach is taken, anyone with access to the VM will be able to get the same service principal credentials by echoing the environment variable.
Instead, I recommend making use of Managed Service Principals, which is a smart feature in Azure that allows you to create an identity which you can assign to VMs to allow them to authenticate with Azure.
Using this feature of Azure, we can omit the client ID and secret ID from the config examples above and successfully authenticate with Azure. This solves the problems described above and removes the security risk. The below code snippet is an example of how to achieve this using Terraform to create an identity.
This code example shows the above identity being associated with a Virtual Machine
Best Practice Takeaway Always use managed service identities in your Azure VMs as it’s the most secure way to authenticate against Azure without the need to place service principal credentials in your Vault and Consul configuration files.
Azure Service Endpoints
The auto-unseal feature described above makes use of Azure Key Vault to store the master key. Permissions to this key can be configured in quite a granular manner to prevent unauthorised access to the key, which is great, but we shouldn’t stop there. The master key stored in the Key Vault are essentially the keys to your kingdom, so we need to protect this as much as possible. We should implement an extra layer of security at the key vault level, which is where Azure Service Endpoints comes in.
Service endpoints is a feature of Azure that allows you to extend the identity of your virtual network to Azure resources, like Azure Key Vaults, so we can add rules that limit access to that resource, in this case it’s Key Vault, to source connections from the virtual network. The removes the default internet access. In fact, when designed and configured correctly, Vault will be the only entity that will be able to access the key vault.
The below code example shows at a high level how you can enable and configure service endpoints for this use case.
Best Practice Takeaway Secure access to your Azure Key Vaults using service endpoints to ensure access to your encryption key is restricted to your Vault Servers.
Azure Availability Zones
Microsoft have recently introduced the concept of Availability Zones in Azure. This introduction compliments HashiCorp’s Vault reference architecture well. It’s worth noting that this feature isn’t generally available in all regions yet but those regions that do support AZs give you a total of three zones. Making use of this feature to spread your cluster across different datacenters provides extra redundancy in disaster scenarios where an entire Datacenter has been lost and is highly encouraged.
The below diagram is taken from Hashicorp’s Vault Reference Architecture guide
Best Practice Takeaway Make use of Azure Availability Zones to spread your Vault and Consul clusters across 3 different data centres for high availability. With this implementation, loss of a data centre will not mean loss of service for your Vault.
Azure has a few different ways you can achieve load balancing; however, from my experience in building production grade clusters, I would only recommend their Application Gateway offering as far as Vault is concerned.
Production Vault clusters should make use of TLS for end-to-end encryption, but when the encryption is terminated at the load balancer level, it becomes less effective, therefore, the load balancer you choose to use should support TLS. For this use case, Azure Application Gateway does support this.
It’s a powerful load balancer which provides full routing capabilities as well as well as health checks, which is useful as far as Vault is concern. One of the challenges with load balancing Vault clusters is making sure it routes traffic to the cluster leader. Vault provides a system health end point, for example, https://vault-url.com:8200/v1/sys/health which the Application gateway can probe. A healthy cluster leader will always return a 200 response, whereas the healthy standby nodes will return a 429 response. With these responses, we can make use of the Application Gateway’s sophisticated routing capabilities to ensure that traffic is always routed to the cluster leader.
The same principles do not need to be applied to the Consul front end if you have the UI enabled, as it would work a little differently with you would only having the UI enabled on 2 out of the 5 nodes in the cluster (as per Hashicorp’s best practice recommendations for Consul) and there is no guarantee that either of the 2 nodes would be a cluster leader. In this scenario, the node pool would only consist of the two nodes with the routing rules sending traffic to the first available healthy node.
The code example below shows how you can configure an application gateway for Vault using the sys/health endpoint.
Best Practice Takeaway When using a load balancer in front of Vault, it must not terminate TLS there. The load balancer must continue the TLS encryption to provide true end-to-end encryption. Use Azure Application Gateway in your set-up to ensure end-to-end encryption.
Azure VM Backups
Disaster recovery will be a key aspect of Vault operations and as such, backups and restore operations are extremely important when running your cluster. Vault is designed in a way whereby the encrypted secrets are stored in a storage backend, Consul in this case. Vault will encrypt and decrypt the data that is stored in Vault as required.
This decoupled approach to building the platform means we can easily add or remove nodes from a Vault cluster, as well as other operations without losing any secrets data. In terms of the Vault element of the platform, this is stateless for the most part as it doesn’t store the data; however, the configuration of Vault could be considered stateful.
To truly leverage the power of the cloud, we need to think about how we can move to an immutable infrastructure workflow. Keeping this in mind when we think about disaster recovery for Vault, we don’t have to think about backups and restore so much as we can use tools like Hashicorp Packer to build the VM images we use for Vault and use Hashicorp Terraform to manage the configuration of the Vault application with Azure Blob storage being a good option to store Terraform state as the data is encrypted by default.
Using Consul as the storage backend for the cluster requires a different approach than Vault as that stores the encrypted secrets data consumed by Vault. This is very much a stateful service and as a result, we need to implement a robust backup and restore process for Consul disaster recovery. There is a plethora of tools available that offer backup services for VMs, each having their pros and cons.
Azure offers a fully managed flexible VM backup service which works in harmony with Azure VMs and makes for an ideal tool to implement a DR strategy for Consul as a storage backend for Vault. The options for how to configure the backup service include backup frequency and retention period. In addition, it also supports backup of encrypted disks, which is the recommended approach for building VMs as discussed earlier in this guide.
Best Practice Takeaway Ensure your data is securely backed up using Azure VM backup service. Practice restoring from backups to ensure your backup strategy is reliable.
Azure Monitoring Solutions
Metrics and Logging are two very key elements for operators of both Consul and Vault. To define the two in the context of Vault in Azure, Metrics is the output of monitoring the Virtual Machines, the operating system and Vault/Consul application performance, for which alert triggers can be assigned to inform operators when thresholds are breached. Logging is concerned with application logs from Consul and Vault, for example, the telemetry from Vault will show all requests sent to vault and the responses, which essentially are security logs. This enables observability, which opens up a world of possibilities. There are more commonly used ways to handle monitoring and observability using tools such as Prometheus, Telegraf, Grafana, ELK and Splunk, which would be the most sensible option to avoid cloud vendor lock-in and for ease of setup; however, for those that are fully invested in the Microsoft Azure Ecosystem with Business requirements dictating that some of the above mentioned tools cannot be used, there is the option of using Azure Monitoring solutions. This provides out-of-the-box metrics for your Virtual Machines with full alerting capabilities which is easily configured per resource. Azure Log analytics is also available to store, aggregate, analyse and query log telemetry data from Consul and Vault, though to enable this functionality with regards to Consul and Vault, you would need to build a solution that pushes the logs to the Azure Logs API but this is easily done.
Deploying and configuring Vault clusters requires an efficient mechanism to manage the desired state and control how changes are implemented in production. There are a variety of tools that can be used in combination with each other to achieve this goal. Effective workflows must be built to ensure the stability and integrity of the environment. There are multiple ways of doing this, but my recommended approach is to divide this into four separate workflows:
This approach provides a clear separation of concerns and allows each element to be handled independently, though there are upstream dependencies. To clarify that, The IAM structure (4) would need the Vault application configuration(3) to be in place, which requires the infrastructure to be deployed (2), which has a dependency on the VM images being built (1).
There are many popular CI/CD tools available that you can use to build pipelines to implement these workflows, from Jenkins to Circle CI. Azure DevOps is a collection of tools that can be used to implement DevOps workflows and includes Azure Boards, Azure Repos and Azure Pipelines. The good thing about Azure DevOps is these components can be used individually or collectively.
Image definitions, Infrastructure declarations, Vault configurations and IAM definitions should all be in code, no matter what the tool, be it Terraform, Ansible, Packer or Pulumi. They should be stored in source control, each in their own repos with versioned releases to be deployed to production. GitHub is most commonly used for this type of application; however, Azure Repos could also be used to achieve this goal. In terms of deployment and testing, Azure Pipelines is a viable alternative to Jenkins, Circle CI, Gitlab and any other CI tool to deploy these changes to the production system.
Best Practice Takeaway Create separate workflows for each element of the vault Deployments Create re-usable VM images and version them. Declare all infrastructure and configuration as code. Store the above code in Version Control Systems Create artefacts from the code which are versioned