Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently flaky #5425

Closed
dblock opened this issue Dec 1, 2022 · 7 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@dblock
Copy link
Member

dblock commented Dec 1, 2022

#5418 (comment)

https://build.ci.opensearch.org/job/gradle-check/7464/

Not reproducible.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.action.admin.indices.create.CreateIndexIT" -Dtests.seed=EA9B1803756CB3DA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Etc/UTC -Druntime.java=19
@dblock dblock added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Dec 1, 2022
@dblock
Copy link
Member Author

dblock commented Dec 1, 2022

#2105 may be related

@soosinha
Copy link
Member

[2023-12-14T03:22:51,283][WARN ][o.o.i.c.IndicesClusterStateService] [node_t2] [test][1] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [test][1]: Recovery failed from {node_t1}{unJbKf_RSSagBnE440OOrw}{a775_BliR_WvTYflMVyh7g}{127.0.0.1}{127.0.0.1:53213}{dimr}{shard_indexing_pressure_enabled=true} into {node_t2}{BzqQ9aUKSOWk7GHnGn8-Xg}{mKanlBDeSOavIkjr4PtdOQ}{127.0.0.1}{127.0.0.1:53215}{dimr}{shard_indexing_pressure_enabled=true} ([test][1]: Recovery failed from {node_t1}{unJbKf_RSSagBnE440OOrw}{a775_BliR_WvTYflMVyh7g}{127.0.0.1}{127.0.0.1:53213}{dimr}{shard_indexing_pressure_enabled=true} into {node_t2}{BzqQ9aUKSOWk7GHnGn8-Xg}{mKanlBDeSOavIkjr4PtdOQ}{127.0.0.1}{127.0.0.1:53215}{dimr}{shard_indexing_pressure_enabled=true})
	at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:136) [classes/:?]
	at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) [classes/:?]
	at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) [classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:751) [classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:681) [classes/:?]
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) [classes/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1497) [classes/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleException$5(InboundHandler.java:447) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [test][1]: Recovery failed from {node_t1}{unJbKf_RSSagBnE440OOrw}{a775_BliR_WvTYflMVyh7g}{127.0.0.1}{127.0.0.1:53213}{dimr}{shard_indexing_pressure_enabled=true} into {node_t2}{BzqQ9aUKSOWk7GHnGn8-Xg}{mKanlBDeSOavIkjr4PtdOQ}{127.0.0.1}{127.0.0.1:53215}{dimr}{shard_indexing_pressure_enabled=true}
	... 9 more
Caused by: org.opensearch.transport.RemoteTransportException: [node_t1][127.0.0.1:53213][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
	at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$22(RecoverySourceHandler.java:627) ~[classes/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[classes/:?]
	at org.opensearch.core.action.ActionListener$4.onFailure(ActionListener.java:192) ~[classes/:?]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[classes/:?]
	at org.opensearch.action.support.RetryableAction.cancel(RetryableAction.java:127) ~[classes/:?]
	at org.opensearch.indices.recovery.RetryableTransportClient.lambda$cancel$1(RetryableTransportClient.java:120) ~[classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) ~[classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: org.opensearch.common.util.CancellableThreads$ExecutionCancelledException: retryable action was cancelled
	at org.opensearch.indices.recovery.RetryableTransportClient.cancel(RetryableTransportClient.java:116) ~[classes/:?]
	at org.opensearch.indices.recovery.RemoteRecoveryTargetHandler.cancel(RemoteRecoveryTargetHandler.java:267) ~[classes/:?]
	at org.opensearch.indices.recovery.RecoverySourceHandler.cancel(RecoverySourceHandler.java:890) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoverySourceService$OngoingRecoveries.cancel(PeerRecoverySourceService.java:298) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoverySourceService.beforeIndexShardClosed(PeerRecoverySourceService.java:146) ~[classes/:?]
	at org.opensearch.index.CompositeIndexEventListener.beforeIndexShardClosed(CompositeIndexEventListener.java:121) ~[classes/:?]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:634) ~[classes/:?]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:618) ~[classes/:?]
	at org.opensearch.index.IndexService.close(IndexService.java:389) ~[classes/:?]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1044) ~[classes/:?]
	at org.opensearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:352) ~[classes/:?]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:282) ~[classes/:?]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608) ~[classes/:?]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595) ~[classes/:?]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563) ~[classes/:?]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) ~[classes/:?]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) ~[classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) ~[classes/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) ~[classes/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) ~[classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

@aasom143
Copy link
Contributor

Ran test for 500 times, it passed successfully.

[ec2-user@ip-172-31-72-21 OpenSearch]$ ./gradlew ':server:internalClusterTest' --tests "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently" -Dtests.seed=EA9B1803756CB3DA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Etc/UTC -Dtests.timeoutSuite=108000000! -Dtests.iters=500
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Linux 6.1.84-99.169.amzn2023.x86_64 (amd64)
  JDK Version           : 21 (Amazon Corretto JRE)
  JAVA_HOME             : /usr/lib/jvm/java-21-amazon-corretto.x86_64
  Random Testing Seed   : EA9B1803756CB3DA
  In FIPS 140 mode      : false
=======================================

> Task :server:internalClusterTest
WARNING: Using incubator modules: jdk.incubator.vector
Apr 24, 2024 6:13:28 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/home/ec2-user/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/home/ec2-user/.gradle/wrapper/dists/gradle-8.7-all/aan3ydargesu18aqyqjwhr3pc/gradle-8.7/lib/plugins/gradle-testing-base-8.7.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 14m 47s
57 actionable tasks: 1 executed, 56 up-to-date
WARNING: The following functionality has been deprecated and will be removed in the next major release of the Develocity Gradle plugin. For assistance with migration, see https://gradle.com/help/gradle-plugin-develocity-migration.
- The deprecated "gradle.enterprise.testretry.enabled" system property has been replaced by "develocity.testretry.enabled"
- The "com.gradle.enterprise" plugin has been replaced by "com.gradle.develocity"

@aasom143
Copy link
Contributor

Ran CreateIndexIT file for 500 times, all tests passed successfully.

[ec2-user@ip-172-31-72-21 OpenSearch]$ ./gradlew ':server:internalClusterTest' --tests "org.opensearch.action.admin.indices.create.CreateIndexIT" -Dtests.seed=EA9B1803756CB3DA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Etc/UTC -Dtests.timeoutSuite=108000000! -Dtests.iters=500
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Linux 6.1.84-99.169.amzn2023.x86_64 (amd64)
  JDK Version           : 21 (Amazon Corretto JRE)
  JAVA_HOME             : /usr/lib/jvm/java-21-amazon-corretto.x86_64
  Random Testing Seed   : EA9B1803756CB3DA
  In FIPS 140 mode      : false
=======================================

> Task :server:internalClusterTest
WARNING: Using incubator modules: jdk.incubator.vector
Apr 24, 2024 6:30:08 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/home/ec2-user/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/home/ec2-user/.gradle/wrapper/dists/gradle-8.7-all/aan3ydargesu18aqyqjwhr3pc/gradle-8.7/lib/plugins/gradle-testing-base-8.7.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 3h 7m 57s
57 actionable tasks: 1 executed, 56 up-to-date
WARNING: The following functionality has been deprecated and will be removed in the next major release of the Develocity Gradle plugin. For assistance with migration, see https://gradle.com/help/gradle-plugin-develocity-migration.
- The deprecated "gradle.enterprise.testretry.enabled" system property has been replaced by "develocity.testretry.enabled"
- The "com.gradle.enterprise" plugin has been replaced by "com.gradle.develocity"

@rwali-aws rwali-aws assigned aasom143 and unassigned amkhar Apr 25, 2024
@aasom143
Copy link
Contributor

aasom143 commented May 16, 2024

Adding one of the thread dump where IT test was stuck after 10 tests. Test is getting stuck while doing the indexing, from the code perspective while doing shardBulkAction call, it's getting stuck.
testCreateAndDeleteIndexConcurrently.txt

As discussed offline with @shwetathareja, she is aligned to move in Indexing team.
Moving test to Indexing bucket.

@rwali-aws
Copy link

ack @aasom143 . moving the Indexing label. thanks!

@rwali-aws rwali-aws added Indexing Indexing, Bulk Indexing and anything related to indexing and removed Cluster Manager labels May 17, 2024
@soosinha soosinha assigned soosinha and unassigned aasom143 May 31, 2024
@reta
Copy link
Collaborator

reta commented Jun 19, 2024

Closing in favour of #14312

@reta reta closed this as completed Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

9 participants