Diagnose server issues
When operating Vault, you can meet with issues during server startup due to a range of root causes, from server configuration issues to operating environment constraints.
Challenge
To effectively troubleshoot and resolve problems with Vault, you must examine and combine information from 3 distinct sources to arrive at root causes:
- Operating system environment conditions, such as user limits.
- Vault server configuration file.
- Vault server log output as described in the Vault Server Logs section of the Troubleshooting Vault tutorial.
Vault server configuration is essential to troubleshooting startup issues, while the log can reveal helpful warnings or errors from Vault that can have root causes related to the operating environment.
Gathering information from the system environment and server logs to narrow in on a root cause can be an arduous process, and even more so when in an outage situation.
It's a task that is ideally suited for automation to ensure that the results are consistent, repeatable, and arrive as needed.
A tool that helps Vault operators gather and interpret this information reduces the troubleshooting burden, lowers time to root cause discovery, and considerably reduces downtime during an outage.
Solution
Vault version 1.8.0 introduces a diagnose
sub-command for the operator
CLI command that help operators with identifying root causes to the most commonly encountered server configuration and startup issues.
You can use the command with the actual configuration for the server you need to diagnose. The typical workflow is to invoke diagnose against server configuration and data while the server is down. There is also an option that allows for performing diagnosis against a running server that you will learn about later.
More information about diagnose is available from the operator diagnose documentation, or by invoking vault operator diagnose -help
from a terminal session.
Here is an actual output example to familiarize you with the types of checks performed and reported on by diagnose.
In this example output, diagnose executed against a Vault Community Edition server.
The diagnose resulted in failure about storage along with some warnings about disk usage, licensing, and TLS.
The command aims to explain results in clear language, so the results are often self-explanatory. It also provides guidance to help with resolving warnings and failures, such as the recommendation to have at least 1GB of space free per partition, for example.
What does diagnose check?
At a high level, the diagnose command checks and reports on these common root causes of server startup issues.
- Environment
- User limits: maximum open files
- Storage capacities
- Configuration
- Access configured storage
- Access HA storage
- Create seal
- Setup core
- Redirect address
- Cluster address
- Listeners
- TLS configuration
- Seal
You will learn more about the types of failures, warnings, and recommendations from diagnose in the scenario.
Prerequisites
To perform the steps in the scenario, you need:
- Vault 1.8 or later; you can use the Community Edition for this tutorial.
- The Install Vault tutorial can guide you through installation.
- jq to handle JSON output from Vault CLI.
Scenario introduction
You will attempt to operate a local Vault server from the command line within a terminal session using the provided example configuration file.
First, you will use diagnose to check the example configuration.
Then, using the information from diagnose, you will resolve a reported failure in the environment.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
Execute diagnose
Execute diagnose to check the initial example configuration.
Your output should resemble this example.
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536
Note that the diagnose resulted in overall failure on line 4, and there is a failure message about storage at lines 16 and 18, along with a warning about the listener TLS configuration at lines 31-32.
The storage related failure on line 18 Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied points to an issue with the Vault data directory, so try confirming the modes on that directory.
The permissions are too restrictive on the data directory.
To understand the Vault log messages around this issue at this point, try to start a Vault server with the configuration.
The Vault server emits a similar permission denied error about the data path when attempting to access the core storage migration key.
Press control
+ c
to stop the server.
Change the mode to 0700
so that Vault can write to the file storage configured for this path.
Execute the diagnose command again to re-check the configuration.
Your output should resemble this example.
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637
You've resolved the failure about storage, but there is at least one warning in the diagnose output remaining.
Note
Depending on your environment, you might notice other warnings not present in the example output, such as warnings about storage volume capacity, open files, or more. You can try to resolve those for a passing result, but it's unnecessary for the purposes of this tutorial.
The warning details that the listener does not enable TLS.
This warning is important, but it doesn't stop you from operating Vault (for example in a dev or QA capacity).
Note
Best practices detailed in Production Hardening recommend operating Vault with end-to-end TLS enabled for production use.
Given that there are no failures, the Vault server should start now even with any warnings present in the diagnose output.
Try again to start the server.
The Vault server started, confirming your resolution of the storage path permission issue.
Doing it live
You can also check a running Vault server by using a -skip
flag to the diagnose command line and specifying the Vault subsystem that diagnose should skip checking. This helps to avoid errors such as Error initializing listener of type tcp: listen tcp 127.0.0.1:8200: bind: address already in use
when using diagnose against a running server.
In a new terminal session, try using diagnose while the Vault server is up and running. This time, use the -skip
flag and specify listener
so that diagnose skips the listener configuration.
Note
To continue resolution of all diagnose warnings in this example configuration requires a valid TLS certificate and key, and setting tls_disable = "false"
or removal of the line entirely. Doing so is beyond the scope of this tutorial, which offers a simple introduction to diagnose.
Cleanup
In the terminal where you started the Vault server, press
control
+c
to stop the server.Remove the temporary directory containing Vault server configuration and data.
Summary
You learned about the diagnose sub-command and how to use it with Vault configuration to check for common issues. You also learned how to use the -skip
flag to diagnose a running Vault server.