Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

Closed
ben-gineer opened this issue May 28, 2024 · 6 comments
Closed

SIGTERM doesn’t gracefully shutdown in 2.12+ #13851

ben-gineer opened this issue May 28, 2024 · 6 comments
Labels
bug Something isn't working Cluster Manager

Comments

@ben-gineer
Copy link

Describe the bug

Since updating to 2.12, sending the opensearch process a SIGTERM does not gracefully shut the service down, leaving node.lock and write.lock files in the data folder. This causes issues with our docker builds. See:

https://forum.opensearch.org/t/cannot-create-pre-baked-docker-image-of-opensearch-2-12/19574

Is there a way to cleanly shutdown without manual cleanup of these lock files?

Related component

Cluster Manager

To Reproduce

  1. Start opensearch in dockerfile
  2. Perform some index population (optional)
  3. Send the process a SIGTERM kill signal
  4. The process leaves various lock files behind

Expected behavior

SIGTERM should clean up gracefully.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@ben-gineer
Copy link
Author

This is also true when using test containers to populate a container and shut it down. It leaves the lock files behind.

@dblock
Copy link
Member

dblock commented May 28, 2024

#1304 is related

@andrross
Copy link
Member

@ben-gineer Is there anything interesting in the log files from the service when a SIGTERM signal is received? Is the process actually exiting and leaving the lock files behind, or does the process stay running and something has to kill it?

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6
@ben-gineer Thanks for creating this issue, while it looks like #1304 is related, this issue should be supported independently of that issue. We'd welcome a pull request or a test case to for the scenario

@ben-gineer
Copy link
Author

Here's a test case:

  • Given this Dockerfile:
FROM opensearchproject/opensearch:2.13.0

RUN /usr/share/opensearch/bin/opensearch-plugin remove --purge opensearch-security

USER root
RUN yum install -y curl-minimal procps

USER opensearch
RUN echo "Starting OpenSearch..." && \
    opensearch -p pid_file -E discovery.type=single-node -E http.port=9200 -d > opensearch.log 2>&1 && \
    while [ "$(curl --write-out %{http_code} --silent --output /dev/null localhost:9200)" -ne "200" ]; do sleep 1 && \
    echo "Waiting for OpenSearch to be up and running..."; done
    # Populate OpenSearch index here...
    # kill -15 `cat pid_file`

EXPOSE 9200
  • I build it as follows:
docker build -t opensearch:test - < Dockerfile    
  • I then run it as follows:
docker run -it -e "discovery.type=single-node" opensearch:test 
  • This will give this error:
[2024-05-29T16:51:53,026][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [3e996e28a47e] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2024-05-29T16:51:52.98557063Z, (lock=NativeFSLock(path=/usr/share/opensearch/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2024-05-29T16:51:33.402366753Z))
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.13.0.jar:2.13.0]
  • This error does not occur if I change the OpenSearch version to 2.11.0 in the Dockerfile

Perhaps there is a better way to check that the OpenSearch instance is up using the health endpoint? However, if I kill the process with -15 signal, Lucene leaves behind those lock files still without cleaning up properly.

@peterzhuamazon
Copy link
Member

Resolved in opensearch-project/opensearch-build#4694.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Cluster Manager Project Board Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: ✅ Done
Development

No branches or pull requests

5 participants