Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstantOn app image fails to restore on OCP if deploy with /logs persistence #550

Open
tam512 opened this issue Jun 5, 2024 · 12 comments
Open

Comments

@tam512
Copy link

tam512 commented Jun 5, 2024

when we deploy InstantOn app image to OCP using the following statefulSet settings

apiVersion: liberty.websphere.ibm.com/v1
kind: WebSphereLibertyApplication
metadata:
    name: quicksec-jakarta
    labels:
      app: quicksec_wlo
spec:
  license:
    accept: true
  applicationImage: my_instanton_app_image
  statefulSet: 
    storage:
      size: 2Gi
      mountPath: "/logs"
..............

the server will fail to start with the following errors:

mkdir: cannot create directory ‘/logs/checkpoint’: Permission denied
/opt/ibm/wlp/bin/server: line 208: /logs/console.log: Permission denied
tail: cannot open '/logs/console.log' for reading: No such file or directory
tail: no files remaining
Error (criu/log.c:221): Can't create log file /logs/checkpoint/restore.log: No such file or directory
/opt/ibm/wlp/bin/server: line 1377: kill: (215) - No such process
CWWKE0961I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
JVMJ9GC063E Unable to open file '/logs/verbosegc.log' for writing
@tam512
Copy link
Author

tam512 commented Jun 5, 2024

@tjwatson
Copy link
Member

@tam512 do you see any logs persisted with a container image that doesn't use InstantOn in this setup? I see no way logs can be persisted if the mount of the /logs/ folder isn't writable by the user in the running container.

@tjwatson
Copy link
Member

I can recreate outside of OCP if I mount a directory which is not writable by the user in the container.

When I attempt to start the same application image without the use of InstantOn then I get loads of output from verbose GC (because it cannot write to the verbosegc log and ends up logging to the standard out. I also don't see any messages or trace logs in the mounted logs directory.

For the OCP case to work the logs mount has to be writable by the user in the container. That does not appear to be the case here. I am unsure how you configure the deployment on OCP to have a writable mount here.

@tam512
Copy link
Author

tam512 commented Jun 18, 2024

I can see the logs persisted without instantOn. I just deployed the app image to OCP and I can see older logs with older time stamps

% oc exec quicksec-jakarta-0  -- ls -l /            
total 8
...........
drwxr-xr-x.   1 root    root       4096 Jun  4 12:53 licenses
drwxrwsr-x.   3 root    1000790000    7 Jun 18 19:41 logs
.............
drwxr-xr-x.   1 root    root         17 Jun  6 13:42 opt
lrwxrwxrwx.   1 root    root         33 Jun  4 12:54 output -> /opt/ibm/wlp/output/defaultServer
..........

% oc exec quicksec-jakarta-0  -- ls -l /logs
total 3088
drwxrws---. 2 1000790000 1000790000     366 Jun  4 15:35 ffdc
-rw-rw----. 1 1000790000 1000790000   65060 Jun  5 18:53 messages_24.06.18_19.41.57.0.log
-rw-r-----. 1 1000790000 1000790000   53270 Jun 18 19:42 messages.log
-rw-rw----. 1 1000790000 1000790000  161940 May 29 21:45 trace_24.05.30_21.50.28.0.log
-rw-rw----. 1 1000790000 1000790000  278480 May 30 21:56 trace.log
-rw-rw----. 1 1000790000 1000790000  294580 Jun 18 19:42 verbosegc.log
-rw-rw----. 1 1000790000 1000790000 2306837 May 24 21:14 verbosegc.log.001

% oc exec quicksec-jakarta-0  -- ls -l /logs/ffdc
total 23197
-rw-rw----. 1 1000790000 1000790000   1205 Jun  4 15:35 exception_summary_24.06.04_15.32.29.0.log
-rw-rw----. 1 1000790000 1000790000   1205 Jun  4 15:38 exception_summary_24.06.04_15.35.49.0.log
-rw-rw----. 1 1000790000 1000790000  30554 May 23 16:44 ffdc_24.05.23_16.44.08.0.log
...........
-rw-rw----. 1 1000790000 1000790000  36550 May 24 02:22 ffdc_24.05.24_02.22.08.0.log
-rw-rw----. 1 1000790000 1000790000  32494 May 24 02:22 ffdc_24.05.24_02.22.09.0.log
........
-rw-rw----. 1 1000790000 1000790000   6907 May 25 11:08 ffdc_24.05.25_11.08.22.0.log
-rw-rw----. 1 1000790000 1000790000 129314 May 28 20:41 ffdc_24.05.28_20.41.27.0.log
.......
-rw-rw----. 1 1000790000 1000790000 129334 May 30 23:43 ffdc_24.05.30_23.43.39.0.log
-rw-rw----. 1 1000790000 1000790000   6906 May 31 00:07 ffdc_24.05.31_00.07.58.0.log
.........
-rw-rw----. 1 1000790000 1000790000  87754 Jun  4 15:15 ffdc_24.06.04_15.15.51.0.log

@tjwatson
Copy link
Member

I just deployed the app image to OCP and I can see older logs with older time stamps

If you just deployed the app image right now to OCP I would expect new logs with today's timestamp.

@tam512
Copy link
Author

tam512 commented Jun 18, 2024

sorry I was not clear about the logs. I saw today's log (Jun 18) and older logs

-rw-rw----. 1 1000790000 1000790000 65060 Jun 5 18:53 messages_24.06.18_19.41.57.0.log
-rw-r-----. 1 1000790000 1000790000 53270 Jun 18 19:42 messages.log

@tjwatson
Copy link
Member

I suspect the issue may be the change of the security context used when deploying the instanton image. Something has to be different there that causes the mounted /logs folder to be read-only.

@tam512
Copy link
Author

tam512 commented Jun 19, 2024

Below is the SecurityContextConstraints I used. Do I need to update anything to make it writeable to /logs

# https://docs.openshift.com/container-platform/4.14/authentication/managing-security-context-constraints.html
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: criu-scc is based on the restricted SCC but removes any restrictions
      that prevent the restore image from running successfully.
  generation: 1
  name: cap-cr-scc
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities:
- CHECKPOINT_RESTORE
- SETPCAP
fsGroup:
  type: RunAsAny
groups:
- system:authenticated
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: RunAsAny
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

@tjwatson
Copy link
Member

Maybe @leochr knows. I'm not sure if you can compare with what the effective one is for normal apps.

@leochr
Copy link
Member

leochr commented Jun 25, 2024

In OpenShift restricted-v2 is used as the SecurityContextConstraints (SCC) by default. Comparing the above SCC to it, I couldn't find any difference that could explain the storage permission issue. Suggest to try the above SCC cap-cr-scc with a non-instanton app.

@tam512
Copy link
Author

tam512 commented Jun 26, 2024

@leochr I also see the permission errors when using the above SCC with non-instanton app, but the app pod says it's running and the server actually started.

quicksec-jakarta-0 1/1 Running 0 6m42s

........
JVMJ9GC063E Unable to open file '/logs/verbosegc.log' for writing
.............
[6/26/24, 21:12:31:378 UTC] 0000002e org.infinispan.HOTROD                                        I ISPN004021: Infinispan version: Infinispan 'Flying Saucer' 14.0.29.Final
TRAS0036E: The system could not create file /logs/messages.log because of the following exception: java.security.PrivilegedActionException: java.io.IOException: Permission denied
...............
[6/26/24, 21:12:35:595 UTC] 0000002b com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 7.904 seconds.
[6/26/24, 21:12:35:595 UTC] 0000002b com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 9.782 seconds.

When I try to copy the app logs, it also throws permission errors

% oc cp quicksec-jakarta-0:/logs quicksec-jakarta-0/      
tar: Removing leading `/' from member names
tar: /logs/verbosegc.log.001: Cannot open: Permission denied
tar: /logs/messages.log: Cannot open: Permission denied
tar: /logs/trace.log: Cannot open: Permission denied
tar: /logs/verbosegc.log: Cannot open: Permission denied
tar: /logs/ffdc: Cannot open: Permission denied
tar: /logs/trace_24.05.30_21.50.28.0.log: Cannot open: Permission denied
tar: /logs/messages_24.06.25_20.41.29.0.log: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors

I posted the full log in slack here https://ibm-cloud.slack.com/archives/CMM0D6JHF/p1719437551545339

@leochr
Copy link
Member

leochr commented Jul 22, 2024

@tam512 Thanks for the info. The custom SCC cap-cr-scc has more permissions than the default restricted-v2, so I don't see how SCC could be the cause of permission issue. This may be another storage-related issue, same as the one seen with trace/dump operations. I suggest to try with a different storage option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants