Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Reader thread termination gracefully #476

Merged
merged 5 commits into from
Sep 7, 2023

Conversation

khushbr
Copy link
Collaborator

@khushbr khushbr commented Aug 15, 2023

Is your feature request related to a problem? Please provide an existing Issue # , or describe.
#468

How can one reproduce the bug?
Update docker/docker-compose.cluster.yml to:

version: '2.1'
services:
  opensearch1:
    environment:
      - node.name=opensearch1
      - discovery.seed_hosts=opensearch2
      - cluster.initial_cluster_manager_nodes=opensearch1
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    tmpfs:
      - /tmp
  opensearch2:
    environment:
      - node.name=opensearch2
      - discovery.seed_hosts=opensearch1
      - cluster.initial_cluster_manager_nodes=opensearch1
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    tmpfs:
      - /tmp

Describe the solution you are proposing
RCA Agent process creates 4 thread as part of process bootstrap. The Reader thread is most critical as it is responsibly for reading, processing and cleaning up upstream metrics generated by Performance Analyzer(PA) plugin. In case of Reader thread crash, the metrics will start accumulating and filling up the disk, while the other components (RCA Graph, Webserver and GRPC Server) perform no work as the metrics are the leaf nodes to all the processing.

The initial design, thus, was to keep creating new reader thread in an infinite loop. This, however, isn't efficient in case of non-retryable/permanent failures. This code changes adds:

  1. Max retry attempt to bring up the Reader Thread
  2. In case Reader Thread doesn't come up, handle the failure by disabling PA plugin and gracefully exiting (using shutdownhook) with process Runtime exit.

Describe alternatives you've considered
The alternative to not terminating the RCA process is to keep the other threads and the process running, while the Reader thread has crashed. In the absence of no metrics flowing through analysis graph, the Controller thread and Servers (web and grpc) do no work.

Testing

1. PerformanceAnalyzer.log with failure simulation

2023-09-06 22:12:50.085 [PA:Reader] [pa-reader] INFO  org.opensearch.performanceanalyzer.PerformanceAnalyzerApp - Exhausted 12 attempts - unable to start Reader Thread successfully; disable PA
2023-09-06 22:12:50.137 [PA:Reader] [pa-reader] ERROR org.opensearch.performanceanalyzer.LocalhostConnectionUtil - PA Disable Request failed: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?]
	at java.net.Socket.connect(Socket.java:591) ~[?:?]
	at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:?]
	at sun.net.www.http.HttpClient.New(HttpClient.java:341) ~[?:?]
	at sun.net.www.http.HttpClient.New(HttpClient.java:362) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1242) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1009) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1331) ~[?:?]
	at org.opensearch.performanceanalyzer.LocalhostConnectionUtil.disablePA(LocalhostConnectionUtil.java:32) ~[performance-analyzer-rca-3.0.0.0-SNAPSHOT.jar:?]
	at org.opensearch.performanceanalyzer.PerformanceAnalyzerApp.handleReaderThreadFailed(PerformanceAnalyzerApp.java:287) ~[performance-analyzer-rca-3.0.0.0-SNAPSHOT.jar:?]
	at org.opensearch.performanceanalyzer.PerformanceAnalyzerApp.lambda$startReaderThread$2(PerformanceAnalyzerApp.java:268) ~[performance-analyzer-rca-3.0.0.0-SNAPSHOT.jar:?]
	at org.opensearch.performanceanalyzer.threads.ThreadProvider.lambda$createThreadForRunnable$0(ThreadProvider.java:47) ~[performance-analyzer-rca-3.0.0.0-SNAPSHOT.jar:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
2023-09-06 22:12:50.452 [PA:Reader] [grpc-server] INFO  org.opensearch.performanceanalyzer.net.NetServer - gRPC server started successfully!
2023-09-06 22:14:36.841 [PA:Reader] [pa-reader] INFO  org.opensearch.performanceanalyzer.LocalhostConnectionUtil - PA Disable Response: 200 OK
2023-09-06 22:14:37.007 [PA:Reader] [pa-reader] INFO  org.opensearch.performanceanalyzer.PerformanceAnalyzerApp - PA disable succeeded.
2023-09-06 22:14:37.014 [PA:Reader] [pa-reader] ERROR org.opensearch.performanceanalyzer.PerformanceAnalyzerApp - Reader thread not coming up successfully - Shutting down RCA Runtime
2023-09-06 22:14:37.035 [PA:Reader] [Thread-3] INFO  org.opensearch.performanceanalyzer.PerformanceAnalyzerApp - Trying to shutdown performance analyzer gracefully

2. supervisord.log after RCA process termination

2023-09-06 22:12:45,412 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2023-09-06 22:12:45,413 WARN No file matches via include "/etc/supervisor/conf.d/*.conf"
2023-09-06 22:12:45,422 INFO RPC interface 'supervisor' initialized
2023-09-06 22:12:45,422 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-09-06 22:12:45,423 INFO daemonizing the supervisord process
2023-09-06 22:12:45,423 INFO supervisord started with pid 14
2023-09-06 22:12:46,426 INFO spawned: 'stop_supervisord' with pid 134
2023-09-06 22:12:46,428 INFO spawned: 'performance_analyzer' with pid 135
2023-09-06 22:12:47,438 INFO success: stop_supervisord entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-06 22:12:47,438 INFO success: performance_analyzer entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-06 22:14:37,662 INFO exited: performance_analyzer (exit status 1; expected)
2023-09-06 22:14:37,715 WARN received SIGQUIT indicating exit request
2023-09-06 22:14:37,716 INFO waiting for stop_supervisord to die
2023-09-06 22:14:38,726 WARN stopped: stop_supervisord (terminated by SIGTERM)

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Khushboo Rajput <khushbr@amazon.com>
@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

Merging #476 (4346d9a) into main (572e7bc) will decrease coverage by 0.34%.
The diff coverage is 0.00%.

❗ Current head 4346d9a differs from pull request most recent head f50e3ee. Consider uploading reports for the commit f50e3ee to get more accurate results

@@             Coverage Diff              @@
##               main     #476      +/-   ##
============================================
- Coverage     74.78%   74.44%   -0.34%     
+ Complexity     2668     2664       -4     
============================================
  Files           316      317       +1     
  Lines         16243    16301      +58     
  Branches       1272     1277       +5     
============================================
- Hits          12147    12136      -11     
- Misses         3581     3651      +70     
+ Partials        515      514       -1     
Files Changed Coverage Δ
...h/performanceanalyzer/LocalhostConnectionUtil.java 0.00% <0.00%> (ø)
...ch/performanceanalyzer/PerformanceAnalyzerApp.java 34.65% <0.00%> (-15.35%) ⬇️
...rg/opensearch/performanceanalyzer/rca/Version.java 0.00% <ø> (ø)

... and 4 files with indirect coverage changes

StatsCollector.instance().logException(StatExceptionCode.READER_ERROR_RCA_AGENT_STOPPED);

// Terminate Java Runtime, executes {@link #shutDownGracefully(ClientServers clientServers)}
System.exit(1);
Copy link
Contributor

@sgup432 sgup432 Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case users want(or try) to enable PA/RCA back via CLI dynamically, I guess it wouldn't work?
How do we handle such scenarios?

Does it make sense if we just terminate the reader thread? As killing rca process, disabling PA plugin(via 9200) all seems too intrusive.

Copy link
Collaborator Author

@khushbr khushbr Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, killing only the PA thread while keeping the process idle is wasting the system resource. The RCA process as we know depends on reader subcomponent for providing the data, which then flows through analysis graph and made available to nodes/users via the grpc and web server resp.

In case users want(or try) to enable PA/RCA back via CLI dynamically, I guess it wouldn't work?
How do we handle such scenarios?

PA can be enabled using the REST API call, while starting RCA will require bringing up the RCA Agent via the performance-analyzer-agent tool. This updated behavior will be added to the documentation.

Let me know your thoughts.

@@ -51,6 +53,7 @@ public class PerformanceAnalyzerApp {

private static final Logger LOG = LogManager.getLogger(PerformanceAnalyzerApp.class);

public static final int READER_RESTART_MAX_ATTEMPTS = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do some test runs in the docker env to understand what could be the correct value for READER_RESTART_MAX_ATTEMPTS. We can also consider exponential backoff here in case the env set up takes time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will increase the number of attempts here to 12(1min).

Exponential Backoff makes sense for API calls but here, a thread is crashing while attempting to read data from disk - unlikely, the backoff retry will help here.

Signed-off-by: Khushboo Rajput <khushbr@amazon.com>
Signed-off-by: Khushboo Rajput <khushbr@amazon.com>
ansjcy
ansjcy previously approved these changes Sep 6, 2023
Copy link
Member

@ansjcy ansjcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @khushbr for making the changes!

…nd PROCESS_STATE_FATAL handling

Signed-off-by: Khushboo Rajput <khushbr@amazon.com>
@khushbr khushbr merged commit 08ce04e into opensearch-project:main Sep 7, 2023
7 of 9 checks passed
@opensearch-trigger-bot
Copy link

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-476-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 08ce04e21f20628d40383c355d73f2865d7b216c
# Push it to GitHub
git push --set-upstream origin backport/backport-476-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-476-to-2.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants