Improve jenkins performance #353

prudhvigodithi · 2023-10-02T16:47:33Z

Description

Improve jenkins performance, this should address the jenkins 504 timeout errors due to OutOfMemoryError

Issues Resolved

Part of #346

Analysis-1

Started to Debug with created a 360 java heap dump using an open source yc tool.

Extracted the yc-2023-09-29T04-32-09.zip and used jhat cli debug the heap dump, noted error as: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.

Also analyzed all the dump files part of the zip and found out multiple threads in BLOCKED state each honoring the setting -Xss4m.

Apart form having the threads in BLOCKED state. noted multiple I/O errors with threads interrupted state.

Took some measures to fix this.

Removed the -Xss setting to avoid large overloaded thread stack size.
Increased the docker memory and application Xmx limits (applied manually to test and observe if this would resolve the frequent OutOfMemoryError error).
Deleted all agent nodes to clean up the I/O errors and threads in interrupted state.
Restarted docker (jenkins) container.

Analysis-2

Coming from jenkins GC performance tuning, added the following to the jenkins Jvm.

- JENKINS_JAVA_OPTS=-Xms33g -Xmx65g -Dhudson.model.ParametersAction.keepUndefinedParameters=true -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1

-XX:+UseG1GC: Changes to Garbage First Garbage Collector strategy, G1GC is designed to be a low-latency garbage collector. It divides the heap into regions and performs garbage collection on specific regions.
-XX:+ExplicitGCInvokesConcurrent: This flag allows explicit garbage collection calls (System.gc()) to run concurrently with the application threads.
-XX:+ParallelRefProcEnabled: This flag enables parallel reference processing which enables us to use the multiple cores more effectively.
-XX:+UseStringDeduplication: This option enables string deduplication in G1GC. It means that duplicate strings in the heap will be deduplicated, helping to save memory by sharing the same underlying string data.
-XX:+UnlockExperimentalVMOptions: This flag allows the use of experimental JVM options.
-XX:+UnlockDiagnosticVMOptions: This flag unlocks diagnostic VM options, allowing the use of additional diagnostic options for monitoring and troubleshooting the JVM.
-XX:G1SummarizeRSetStatsPeriod=1: This option sets the frequency at which G1GC summarizes.

Along with above Jvm options, Manually identified the running threads using the following script and noticed there are some running threads in errored state.

import jenkins.model.Jenkins
import hudson.model.Queue

def jenkins = Jenkins.getInstance()

// Get the Jenkins queue
def queue = jenkins.getQueue()

// Get all items in the queue
def items = queue.getItems()

// Loop through each item in the queue
items.each { item ->
    println "Queue Item: ${item.task.name}, In Queue Since: ${item.inQueueSince}"

    // Get the buildable task associated with the queue item
    def buildable = item.getParams()

    if (buildable) {
        println "  Buildable Task: ${buildable.displayName}"
    }
    
    println ""
}

// Get all active threads
def threads = Thread.getAllStackTraces().keySet()

println "\nRunning Threads:"
threads.each { thread ->
    println "  Thread Name: ${thread.getName()}"
    println "  Thread ID: ${thread.getId()}"
    println "  Thread State: ${thread.getState()}"
    println ""
}

Now Executed interrupt() method on each thread and manually killed it. Then after restarting the jenkins the memory usage is stable and did not notice any 504 errors.

NOTE: Used htop linux tool to monitor the memory in real time.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

gaiksaya · 2023-10-02T17:47:38Z

Should we take care of upgrading the main node instance size? https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L115

prudhvigodithi · 2023-10-02T18:03:14Z

Should we take care of upgrading the main node instance size? https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L115

We already have c5.9xlarge (68gb memory and 36 CPU), this should be sufficient with the limits I have added.

gaiksaya

LGTM! Let's try this out.

resources/docker-compose.yml

Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>

prudhvigodithi requested review from peterzhuamazon, bbarani, gaiksaya, rishabh6788, zelinh, jordarlu and Divyaasm as code owners October 2, 2023 16:47

gaiksaya approved these changes Oct 2, 2023

View reviewed changes

peterzhuamazon reviewed Oct 2, 2023

View reviewed changes

resources/docker-compose.yml Outdated Show resolved Hide resolved

Improve jenkins performance

cbcaee7

Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>

prudhvigodithi force-pushed the cloudwatch branch from 714bf8a to cbcaee7 Compare October 2, 2023 20:45

prudhvigodithi merged commit a65e0d6 into opensearch-project:main Oct 2, 2023
3 checks passed

prudhvigodithi mentioned this pull request Oct 13, 2023

Performance degradation in Jenkins opensearch-project/opensearch-build#4130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve jenkins performance #353

Improve jenkins performance #353

prudhvigodithi commented Oct 2, 2023 •

edited

Loading

gaiksaya commented Oct 2, 2023

prudhvigodithi commented Oct 2, 2023

gaiksaya left a comment

Improve jenkins performance #353

Improve jenkins performance #353

Conversation

prudhvigodithi commented Oct 2, 2023 • edited Loading

Description

Issues Resolved

gaiksaya commented Oct 2, 2023

prudhvigodithi commented Oct 2, 2023

gaiksaya left a comment

Choose a reason for hiding this comment

prudhvigodithi commented Oct 2, 2023 •

edited

Loading