Skip to content

Commit

Permalink
CASMCMS-8393 - add console log rotation documentation. (#5438)
Browse files Browse the repository at this point in the history
* CASMCMS-8393 - Document console log rotation settings.

* Revert other changes.

* CASMCMS-8393 - document console services log rotation.

* Fix markdown checker errors.

* Fix indentations.

* Fix markdown errors.

* Fix for PR comments.
  • Loading branch information
dlaine-hpe authored Oct 11, 2024
1 parent 72fad33 commit 5b5ecfc
Show file tree
Hide file tree
Showing 3 changed files with 233 additions and 0 deletions.
1 change: 1 addition & 0 deletions operations/conman/ConMan.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ There are multiple `cray-console-node` pods, scaled to the size of the system.
- [Establish a Serial Connection to an NCN](Establish_a_Serial_Connection_to_NCNs.md)
- [Disable ConMan After System Software Installation](Disable_ConMan_After_System_Software_Installation.md)
- [Access Console Log Data Via the System Monitoring Framework (SMF)](Access_Console_Log_Data_Via_the_System_Monitoring_Framework_SMF.md)
- [Configure Log Rotation](Configure_Log_Rotation.md)

## Troubleshooting

Expand Down
127 changes: 127 additions & 0 deletions operations/conman/Configure_Log_Rotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Configure Log Rotation

In order to prevent the console logs from filling the PVC volume they are stored on
they are periodically rotated. This can keep a number of older sections of the log
file as well as the current log file on the volume. Different size systems have
different requirements based on the number of nodes, the amount of text being written
to the individual log files, the size of the PVC they are being stored on, and the
history that needs to be kept in the form of the log files.

All of the console log information is kept in the System Monitoring Framework so these
log files are not required for a permanent record of the console activity. See
[Access Console Log Data Via the System Monitoring Framework](./Access_Console_Log_Data_Via_the_System_Monitoring_Framework_SMF.md)
for more information on this topic.

> **`NOTE`** Log rotation will move the current log file and create a new one with the original
location and name. If you are using a `tail` operation to watch the console log output,
make sure to use the `tail -F` option to automatically switch the `tail` to the new
file through a log rotation. Otherwise the `tail` will follow the old file which has
moved and is no longer being appended to with new console log information.

## How log rotation works

On a regular schedule, the log rotation will execute the following steps:

1. Check the size of all the current console log files.

If the size of the file is larger than a specified size, it will be
moved to the `/var/log/conman.old` directory with the name
`console.XNAME.1` and a new file will be created for the current logs
`/var/log/conman/console.XNAME`.

1. Manage the current backup files.

If a file already exists in the `/var/log/conman.old` directory for
a particular console log that is being rotated, the existing files
will be renamed `/var/log/conman.old/console.XNAME.N+1`.

There is a configuration setting for how many rotations to keep, once
that limit is reached, the oldest version of the console log file will
be deleted.

## Modify the settings for the log rotation

1. Edit the `cray-console-node` stateful set:

```bash
kubectl -n services edit statefulset cray-console-node
```

1. Look for the section that contains log rotation settings:

```text
- env:
- name: LOG_ROTATE_ENABLE
value: "True"
- name: LOG_ROTATE_FILE_SIZE
value: 5M
- name: LOG_ROTATE_SEC_FREQ
value: "600"
- name: LOG_ROTATE_NUM_KEEP
value: "2"
```

1. `LOG_ROTATE_ENABLE`

This enables or disables the log rotation feature overall. If you wish to
not have any log rotation happen at all, then set the value to 'False' but
you must keep a close eye on the capacity of the PVC.

1. `LOG_ROTATE_SEC_FREQ`

This sets how often the log rotation will happen in seconds. The default is
every 600 seconds (10 minutes). If you want rotation to happen more often
decrease this setting, if you want it to happen more often increase it. This
is the interval between when log rotation completes and when it starts again
so if the rotation takes a bit of time you may see the actual time between
to subsequent log rotations end up longer than this interval.

1. `LOG_ROTATE_FILE_SIZE`

This is the size of a file to rotate. When the log rotation happens, if an
individual log file is larger than this size, it will be rotated.

Depending on how often the log rotation is executed and how quickly the file
is growing you may see the files get quite a bit larger than this size when
the rotation actually happens. If files are growing significantly larger than
this setting increase the frequency of log rotations.

1. `LOG_ROTATE_NUM_KEEP`

This is the number of log rotations it will keep in the `/var/log/conman.old`
directory. For example if this value is 2, there will be a
`/var/log/conman.old/console.XNAME.1` and `/var/log/conman.old/console.XNAME.2`
file for each console that has logging active (after sufficient time has passed
for the file to be rotated twice). Setting this value to 0 will prevent any
older files to be kept.

## Scenarios that may be encountered and possible solutions

1. The log files are getting too large before they are being rotated.

Decrease the value of `LOG_ROTATE_FILE_SIZE` to make smaller files
subject to rotation.

If the files are larger than the `LOG_ROTATE_FILE_SIZE`, decrease the
value of `LOG_ROTATE_SEC_FREQ` so the rotation happens too often.

1. Log files are being rotated before a complete boot.

If the boot operation outputs a lot of information, increase the value of
`LOG_ROTATE_FILE_SIZE` to keep the file larger before a rotation will
happen.

1. The PVC is being filled up.

This means there is too much data being retained for the current size of the PVC.
The following may be done to decrease the amount of data:

1. Decrease the value of `LOG_ROTATE_FILE_SIZE` to keep the file size down.

1. Decrease the value of `LOG_ROTATE_SEC_FREQ` to rotate the log files more frequently.

1. Decrease the value of `LOG_ROTATE_NUM_KEEP` to keep fewer old copies of the log files.

If none of these steps are appropriate for the requirements of the system, the size of the
PVC may be increased by following the directions here:
[Console Services Troubleshooting Guide](./Console_Services_Troubleshooting_Guide.md#check-the-capacity-of-the-pvc)
105 changes: 105 additions & 0 deletions operations/conman/Console_Services_Troubleshooting_Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ how to look at all aspects of the service to determine what the current problem
* [Find the `cray-console-node` pod for a specific node](#find-the-cray-console-node-pod-for-a-specific-node)
* [Investigate service problem](#investigate-service-problem)
* [Investigate Postgres deployment](#investigate-postgres-deployment)
* [Check the capacity of the PVC](#check-the-capacity-of-the-pvc)

## Prerequisites

Expand Down Expand Up @@ -229,3 +230,107 @@ If the database can not be made healthy through these procedures, the easiest wa
resolve this is to perform a complete reset of the console services including
reinstalling the `cray-console-data` service. See
[Complete Reset of the Console Services](Complete_Reset_of_the_Console_Services.md).

## Check the capacity of the PVC

There is a shared PVC that is mounted to all the `cray-console-node` pods that is used to
write the individual console log files. If this volume fills up, the log files will no
longer be written to and log data will be lost. If following a log file it will look like
the logging has stopped, but logging into the log directly with 'conman' will still show
the current console log.

This volume is mounted on the `/var/log` directory inside the `cray-console-node` pods.
To check the usage of this PVC:

1. (`ncn-mw#`) Log into one of the `cray-console-node` pods.

```bash
kubectl -n services exec -it cray-console-node-0 -c cray-console-node -- sh
```

1. (`pod#`) Check the volume usage.

```bash
df -h | grep -E 'Size|/var/log'
```

Expected results will look something like:

```text
Filesystem Size Used Avail Use% Mounted on
10.252.1.18:6789:/volumes/csi/csi-vol-0f39... 100G 36M 100G 1% /var/log
```

If the 'used' value is approaching or equal to the 'Size' value, the volume is
filling up.

There are a couple of ways to resolve this situation.

1. Remove excess files from the volume.

The console files are stored in `/var/log/conman` and named `console.XNAME` to
distinguish which log files are from which nodes. If there are some log files that
are left over from nodes no longer in use, they may be removed.

The backup files for the console logs are stored in `/var/log/conman.old`. When
the individual files get too large they are moved to this directory by the
`logrotate` application. If these files are not needed for looking through historical
console logs, they may be removed.

The files in the `/var/log/console` directory are small and required for the
operation of the console services so do not remove them.

1. Adjust the log rotation settings.

The `logrotate` application is used to manage the size of the log files as they
grow over time. The settings for this functionality are described in
[Configure Log Rotation](Configure_Log_Rotation.md). Tune the settings for this
system to prevent the log files from filling up the PVC.

1. (`ncn-mw#`) Increase the size of the PVC.

If the system is large, the default settings for the log rotation and the PVC
size may not be sufficient to hold the console log files and the backups. If
more backups are required than can fit on the current PVC, it may be increased
in size without losing any of the current data on the volume.

1. Edit the PVC to increase the size.

```bash
kubectl -n services edit pvc cray-console-operator-data-claim
```

Modify the value of `spec.resources.requests.storage` to increased value required:

```text
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 150Gi
```

1. Scale the number of `cray-console-operator` pods to zero.

```bash
kubectl -n services scale deployment --replicas=0 cray-console-operator
```

1. Scale the number of `cray-console-node` pods to zero.

```bash
kubectl -n services scale statefulset --replicas=0 cray-console-node
```

1. Wait for these pods to terminate.

1. Scale the number of `cray-console-operator` pods to one.

```bash
kubectl -n services scale deployment --replicas=1 cray-console-operator
```

When the `cray-console-operator` pod resumes operation it will scale the number
`cray-console-node` pods back up automatically. After all pods are back up and
ready, the new increased size of the PVC will be visible from within the pods.

0 comments on commit 5b5ecfc

Please sign in to comment.