From 5b5ecfcc912bf439afcdaa755ecacded7c02962a Mon Sep 17 00:00:00 2001 From: David Laine <77020169+dlaine-hpe@users.noreply.github.com> Date: Fri, 11 Oct 2024 15:18:22 -0500 Subject: [PATCH] CASMCMS-8393 - add console log rotation documentation. (#5438) * CASMCMS-8393 - Document console log rotation settings. * Revert other changes. * CASMCMS-8393 - document console services log rotation. * Fix markdown checker errors. * Fix indentations. * Fix markdown errors. * Fix for PR comments. --- operations/conman/ConMan.md | 1 + operations/conman/Configure_Log_Rotation.md | 127 ++++++++++++++++++ .../Console_Services_Troubleshooting_Guide.md | 105 +++++++++++++++ 3 files changed, 233 insertions(+) create mode 100644 operations/conman/Configure_Log_Rotation.md diff --git a/operations/conman/ConMan.md b/operations/conman/ConMan.md index 7073c7b8b8aa..3b9ef924ebd7 100644 --- a/operations/conman/ConMan.md +++ b/operations/conman/ConMan.md @@ -16,6 +16,7 @@ There are multiple `cray-console-node` pods, scaled to the size of the system. - [Establish a Serial Connection to an NCN](Establish_a_Serial_Connection_to_NCNs.md) - [Disable ConMan After System Software Installation](Disable_ConMan_After_System_Software_Installation.md) - [Access Console Log Data Via the System Monitoring Framework (SMF)](Access_Console_Log_Data_Via_the_System_Monitoring_Framework_SMF.md) +- [Configure Log Rotation](Configure_Log_Rotation.md) ## Troubleshooting diff --git a/operations/conman/Configure_Log_Rotation.md b/operations/conman/Configure_Log_Rotation.md new file mode 100644 index 000000000000..6201ca993bc8 --- /dev/null +++ b/operations/conman/Configure_Log_Rotation.md @@ -0,0 +1,127 @@ +# Configure Log Rotation + +In order to prevent the console logs from filling the PVC volume they are stored on +they are periodically rotated. This can keep a number of older sections of the log +file as well as the current log file on the volume. Different size systems have +different requirements based on the number of nodes, the amount of text being written +to the individual log files, the size of the PVC they are being stored on, and the +history that needs to be kept in the form of the log files. + +All of the console log information is kept in the System Monitoring Framework so these +log files are not required for a permanent record of the console activity. See +[Access Console Log Data Via the System Monitoring Framework](./Access_Console_Log_Data_Via_the_System_Monitoring_Framework_SMF.md) +for more information on this topic. + +> **`NOTE`** Log rotation will move the current log file and create a new one with the original + location and name. If you are using a `tail` operation to watch the console log output, + make sure to use the `tail -F` option to automatically switch the `tail` to the new + file through a log rotation. Otherwise the `tail` will follow the old file which has + moved and is no longer being appended to with new console log information. + +## How log rotation works + +On a regular schedule, the log rotation will execute the following steps: + +1. Check the size of all the current console log files. + + If the size of the file is larger than a specified size, it will be + moved to the `/var/log/conman.old` directory with the name + `console.XNAME.1` and a new file will be created for the current logs + `/var/log/conman/console.XNAME`. + +1. Manage the current backup files. + + If a file already exists in the `/var/log/conman.old` directory for + a particular console log that is being rotated, the existing files + will be renamed `/var/log/conman.old/console.XNAME.N+1`. + + There is a configuration setting for how many rotations to keep, once + that limit is reached, the oldest version of the console log file will + be deleted. + +## Modify the settings for the log rotation + +1. Edit the `cray-console-node` stateful set: + + ```bash + kubectl -n services edit statefulset cray-console-node + ``` + +1. Look for the section that contains log rotation settings: + + ```text + - env: + - name: LOG_ROTATE_ENABLE + value: "True" + - name: LOG_ROTATE_FILE_SIZE + value: 5M + - name: LOG_ROTATE_SEC_FREQ + value: "600" + - name: LOG_ROTATE_NUM_KEEP + value: "2" + ``` + + 1. `LOG_ROTATE_ENABLE` + + This enables or disables the log rotation feature overall. If you wish to + not have any log rotation happen at all, then set the value to 'False' but + you must keep a close eye on the capacity of the PVC. + + 1. `LOG_ROTATE_SEC_FREQ` + + This sets how often the log rotation will happen in seconds. The default is + every 600 seconds (10 minutes). If you want rotation to happen more often + decrease this setting, if you want it to happen more often increase it. This + is the interval between when log rotation completes and when it starts again + so if the rotation takes a bit of time you may see the actual time between + to subsequent log rotations end up longer than this interval. + + 1. `LOG_ROTATE_FILE_SIZE` + + This is the size of a file to rotate. When the log rotation happens, if an + individual log file is larger than this size, it will be rotated. + + Depending on how often the log rotation is executed and how quickly the file + is growing you may see the files get quite a bit larger than this size when + the rotation actually happens. If files are growing significantly larger than + this setting increase the frequency of log rotations. + + 1. `LOG_ROTATE_NUM_KEEP` + + This is the number of log rotations it will keep in the `/var/log/conman.old` + directory. For example if this value is 2, there will be a + `/var/log/conman.old/console.XNAME.1` and `/var/log/conman.old/console.XNAME.2` + file for each console that has logging active (after sufficient time has passed + for the file to be rotated twice). Setting this value to 0 will prevent any + older files to be kept. + +## Scenarios that may be encountered and possible solutions + +1. The log files are getting too large before they are being rotated. + + Decrease the value of `LOG_ROTATE_FILE_SIZE` to make smaller files + subject to rotation. + + If the files are larger than the `LOG_ROTATE_FILE_SIZE`, decrease the + value of `LOG_ROTATE_SEC_FREQ` so the rotation happens too often. + +1. Log files are being rotated before a complete boot. + + If the boot operation outputs a lot of information, increase the value of + `LOG_ROTATE_FILE_SIZE` to keep the file larger before a rotation will + happen. + +1. The PVC is being filled up. + + This means there is too much data being retained for the current size of the PVC. + The following may be done to decrease the amount of data: + + 1. Decrease the value of `LOG_ROTATE_FILE_SIZE` to keep the file size down. + + 1. Decrease the value of `LOG_ROTATE_SEC_FREQ` to rotate the log files more frequently. + + 1. Decrease the value of `LOG_ROTATE_NUM_KEEP` to keep fewer old copies of the log files. + + If none of these steps are appropriate for the requirements of the system, the size of the + PVC may be increased by following the directions here: + [Console Services Troubleshooting Guide](./Console_Services_Troubleshooting_Guide.md#check-the-capacity-of-the-pvc) diff --git a/operations/conman/Console_Services_Troubleshooting_Guide.md b/operations/conman/Console_Services_Troubleshooting_Guide.md index 09ea7ee7458a..a5cb874044a2 100644 --- a/operations/conman/Console_Services_Troubleshooting_Guide.md +++ b/operations/conman/Console_Services_Troubleshooting_Guide.md @@ -9,6 +9,7 @@ how to look at all aspects of the service to determine what the current problem * [Find the `cray-console-node` pod for a specific node](#find-the-cray-console-node-pod-for-a-specific-node) * [Investigate service problem](#investigate-service-problem) * [Investigate Postgres deployment](#investigate-postgres-deployment) +* [Check the capacity of the PVC](#check-the-capacity-of-the-pvc) ## Prerequisites @@ -229,3 +230,107 @@ If the database can not be made healthy through these procedures, the easiest wa resolve this is to perform a complete reset of the console services including reinstalling the `cray-console-data` service. See [Complete Reset of the Console Services](Complete_Reset_of_the_Console_Services.md). + +## Check the capacity of the PVC + +There is a shared PVC that is mounted to all the `cray-console-node` pods that is used to +write the individual console log files. If this volume fills up, the log files will no +longer be written to and log data will be lost. If following a log file it will look like +the logging has stopped, but logging into the log directly with 'conman' will still show +the current console log. + +This volume is mounted on the `/var/log` directory inside the `cray-console-node` pods. +To check the usage of this PVC: + +1. (`ncn-mw#`) Log into one of the `cray-console-node` pods. + + ```bash + kubectl -n services exec -it cray-console-node-0 -c cray-console-node -- sh + ``` + +1. (`pod#`) Check the volume usage. + + ```bash + df -h | grep -E 'Size|/var/log' + ``` + + Expected results will look something like: + + ```text + Filesystem Size Used Avail Use% Mounted on + 10.252.1.18:6789:/volumes/csi/csi-vol-0f39... 100G 36M 100G 1% /var/log + ``` + + If the 'used' value is approaching or equal to the 'Size' value, the volume is + filling up. + +There are a couple of ways to resolve this situation. + +1. Remove excess files from the volume. + + The console files are stored in `/var/log/conman` and named `console.XNAME` to + distinguish which log files are from which nodes. If there are some log files that + are left over from nodes no longer in use, they may be removed. + + The backup files for the console logs are stored in `/var/log/conman.old`. When + the individual files get too large they are moved to this directory by the + `logrotate` application. If these files are not needed for looking through historical + console logs, they may be removed. + + The files in the `/var/log/console` directory are small and required for the + operation of the console services so do not remove them. + +1. Adjust the log rotation settings. + + The `logrotate` application is used to manage the size of the log files as they + grow over time. The settings for this functionality are described in + [Configure Log Rotation](Configure_Log_Rotation.md). Tune the settings for this + system to prevent the log files from filling up the PVC. + +1. (`ncn-mw#`) Increase the size of the PVC. + + If the system is large, the default settings for the log rotation and the PVC + size may not be sufficient to hold the console log files and the backups. If + more backups are required than can fit on the current PVC, it may be increased + in size without losing any of the current data on the volume. + + 1. Edit the PVC to increase the size. + + ```bash + kubectl -n services edit pvc cray-console-operator-data-claim + ``` + + Modify the value of `spec.resources.requests.storage` to increased value required: + + ```text + spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 150Gi + ``` + + 1. Scale the number of `cray-console-operator` pods to zero. + + ```bash + kubectl -n services scale deployment --replicas=0 cray-console-operator + ``` + + 1. Scale the number of `cray-console-node` pods to zero. + + ```bash + kubectl -n services scale statefulset --replicas=0 cray-console-node + ``` + + 1. Wait for these pods to terminate. + + 1. Scale the number of `cray-console-operator` pods to one. + + ```bash + kubectl -n services scale deployment --replicas=1 cray-console-operator + ``` + + When the `cray-console-operator` pod resumes operation it will scale the number + `cray-console-node` pods back up automatically. After all pods are back up and + ready, the new increased size of the PVC will be visible from within the pods.