SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

ben-gineer · 2024-05-28T06:35:37Z

Describe the bug

Since updating to 2.12, sending the opensearch process a SIGTERM does not gracefully shut the service down, leaving node.lock and write.lock files in the data folder. This causes issues with our docker builds. See:

https://forum.opensearch.org/t/cannot-create-pre-baked-docker-image-of-opensearch-2-12/19574

Is there a way to cleanly shutdown without manual cleanup of these lock files?

Related component

Cluster Manager

To Reproduce

Start opensearch in dockerfile
Perform some index population (optional)
Send the process a SIGTERM kill signal
The process leaves various lock files behind

Expected behavior

SIGTERM should clean up gracefully.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

ben-gineer · 2024-05-28T19:26:59Z

This is also true when using test containers to populate a container and shut it down. It leaves the lock files behind.

dblock · 2024-05-28T21:26:46Z

#1304 is related

andrross · 2024-05-28T22:04:15Z

@ben-gineer Is there anything interesting in the log files from the service when a SIGTERM signal is received? Is the process actually exiting and leaving the lock files behind, or does the process stay running and something has to kill it?

peternied · 2024-05-29T15:19:51Z

[Triage - attendees 1 2 3 4 5 6
@ben-gineer Thanks for creating this issue, while it looks like #1304 is related, this issue should be supported independently of that issue. We'd welcome a pull request or a test case to for the scenario

ben-gineer · 2024-05-29T17:02:25Z

Here's a test case:

Given this Dockerfile:

FROM opensearchproject/opensearch:2.13.0

RUN /usr/share/opensearch/bin/opensearch-plugin remove --purge opensearch-security

USER root
RUN yum install -y curl-minimal procps

USER opensearch
RUN echo "Starting OpenSearch..." && \
    opensearch -p pid_file -E discovery.type=single-node -E http.port=9200 -d > opensearch.log 2>&1 && \
    while [ "$(curl --write-out %{http_code} --silent --output /dev/null localhost:9200)" -ne "200" ]; do sleep 1 && \
    echo "Waiting for OpenSearch to be up and running..."; done
    # Populate OpenSearch index here...
    # kill -15 `cat pid_file`

EXPOSE 9200

I build it as follows:

docker build -t opensearch:test - < Dockerfile

I then run it as follows:

docker run -it -e "discovery.type=single-node" opensearch:test

This will give this error:

[2024-05-29T16:51:53,026][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [3e996e28a47e] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-05-29T16:51:52.98557063Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-05-29T16:51:33.402366753Z))
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.13.0.jar:2.13.0]

This error does not occur if I change the OpenSearch version to 2.11.0 in the Dockerfile

Perhaps there is a better way to check that the OpenSearch instance is up using the health endpoint? However, if I kill the process with -15 signal, Lucene leaves behind those lock files still without cleaning up properly.

peterzhuamazon · 2024-07-02T22:46:48Z

Resolved in opensearch-project/opensearch-build#4694.

ben-gineer added bug Something isn't working untriaged labels May 28, 2024

github-actions bot added the Cluster Manager label May 28, 2024

github-project-automation bot added this to Cluster Manager Project Board May 28, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board May 28, 2024

peternied removed the untriaged label May 29, 2024

liamwhite mentioned this issue Jun 27, 2024

[BUG] Container SIGTERM does nothing on Docker Hub images #14577

Closed

peterzhuamazon closed this as completed Jul 2, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Cluster Manager Project Board Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

ben-gineer commented May 28, 2024

ben-gineer commented May 28, 2024

dblock commented May 28, 2024

andrross commented May 28, 2024

peternied commented May 29, 2024

ben-gineer commented May 29, 2024

peterzhuamazon commented Jul 2, 2024

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

Comments

ben-gineer commented May 28, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

ben-gineer commented May 28, 2024

dblock commented May 28, 2024

andrross commented May 28, 2024

peternied commented May 29, 2024

ben-gineer commented May 29, 2024

peterzhuamazon commented Jul 2, 2024