Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve jenkins performance #353

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

prudhvigodithi
Copy link
Collaborator

@prudhvigodithi prudhvigodithi commented Oct 2, 2023

Description

Improve jenkins performance, this should address the jenkins 504 timeout errors due to OutOfMemoryError

Issues Resolved

Part of #346

Analysis-1

Started to Debug with created a 360 java heap dump using an open source yc tool.

Extracted the yc-2023-09-29T04-32-09.zip and used jhat cli debug the heap dump, noted error as: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.

Also analyzed all the dump files part of the zip and found out multiple threads in BLOCKED state each honoring the setting -Xss4m.

Apart form having the threads in BLOCKED state. noted multiple I/O errors with threads interrupted state.

Took some measures to fix this.

  • Removed the -Xss setting to avoid large overloaded thread stack size.
  • Increased the docker memory and application Xmx limits (applied manually to test and observe if this would resolve the frequent OutOfMemoryError error).
  • Deleted all agent nodes to clean up the I/O errors and threads in interrupted state.
  • Restarted docker (jenkins) container.

Analysis-2

Coming from jenkins GC performance tuning, added the following to the jenkins Jvm.

- JENKINS_JAVA_OPTS=-Xms33g -Xmx65g -Dhudson.model.ParametersAction.keepUndefinedParameters=true -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1

-XX:+UseG1GC: Changes to Garbage First Garbage Collector strategy, G1GC is designed to be a low-latency garbage collector. It divides the heap into regions and performs garbage collection on specific regions.
-XX:+ExplicitGCInvokesConcurrent: This flag allows explicit garbage collection calls (System.gc()) to run concurrently with the application threads.
-XX:+ParallelRefProcEnabled: This flag enables parallel reference processing which enables us to use the multiple cores more effectively.
-XX:+UseStringDeduplication: This option enables string deduplication in G1GC. It means that duplicate strings in the heap will be deduplicated, helping to save memory by sharing the same underlying string data.
-XX:+UnlockExperimentalVMOptions: This flag allows the use of experimental JVM options.
-XX:+UnlockDiagnosticVMOptions: This flag unlocks diagnostic VM options, allowing the use of additional diagnostic options for monitoring and troubleshooting the JVM.
-XX:G1SummarizeRSetStatsPeriod=1: This option sets the frequency at which G1GC summarizes.

Along with above Jvm options, Manually identified the running threads using the following script and noticed there are some running threads in errored state.

import jenkins.model.Jenkins
import hudson.model.Queue

def jenkins = Jenkins.getInstance()

// Get the Jenkins queue
def queue = jenkins.getQueue()

// Get all items in the queue
def items = queue.getItems()

// Loop through each item in the queue
items.each { item ->
    println "Queue Item: ${item.task.name}, In Queue Since: ${item.inQueueSince}"

    // Get the buildable task associated with the queue item
    def buildable = item.getParams()

    if (buildable) {
        println "  Buildable Task: ${buildable.displayName}"
    }
    
    println ""
}

// Get all active threads
def threads = Thread.getAllStackTraces().keySet()

println "\nRunning Threads:"
threads.each { thread ->
    println "  Thread Name: ${thread.getName()}"
    println "  Thread ID: ${thread.getId()}"
    println "  Thread State: ${thread.getState()}"
    println ""
}

Now Executed interrupt() method on each thread and manually killed it. Then after restarting the jenkins the memory usage is stable and did not notice any 504 errors.

NOTE: Used htop linux tool to monitor the memory in real time.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@gaiksaya
Copy link
Member

gaiksaya commented Oct 2, 2023

Should we take care of upgrading the main node instance size? https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L115

@prudhvigodithi
Copy link
Collaborator Author

Should we take care of upgrading the main node instance size? https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L115

We already have c5.9xlarge (68gb memory and 36 CPU), this should be sufficient with the limits I have added.

Copy link
Member

@gaiksaya gaiksaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's try this out.

Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>
@prudhvigodithi prudhvigodithi merged commit a65e0d6 into opensearch-project:main Oct 2, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants