Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak, C2 compiler #1190

Closed
1 task done
shalseth opened this issue Oct 29, 2024 · 12 comments
Closed
1 task done

Memory leak, C2 compiler #1190

shalseth opened this issue Oct 29, 2024 · 12 comments
Labels
bug Something isn't working jbs:reported Someone from our org has reported it to OpenJDK

Comments

@shalseth
Copy link

shalseth commented Oct 29, 2024

Please provide a brief summary of the bug

As of Temurin 21.0.3 we are experiencing excessive memory usage (and oom-killer) from our Java Tomcat containers.

This is off-heap memory usage, and we can easily trigger a 10GB mem usage (real mem usage, memstat:total_rss) with a Xmx=1G. From our current testing, the mem usage will go as high as whatever we configure as the docker container memory limit.

From a total_rss usage of 1GB, the mem usage within ~10 seconds goes up to the limit 10GB.

Application memory leaks are not a factor, we have checked heap with jmap, just to make sure.

We have no issues on Temurin 21.0.1 og 21.0.2, only starting with 21.0.3. Same issue also present on latest 21.0.5.

We have isolated the memory usage to the C2 Compiler thread, based on oom-killer syslog output, and with Native Memory Tracking.

On a lightly loaded production node we experience this after 1-2 hours, so we can quite easily reproduce the issue.

We are using Tomcat images from https://hub.docker.com/_/tomcat.

Ideally we should have some code or easy way to reproduce this outside of our application, but our application is quite large, and we don't know exactly what triggers the issue.

Realize it will be hard to track down based on this bug report, but I will post it anyway, in case others experience similar problems and / or there's some additional logs or debug output we could provide.

Did you test with the latest update version?

  • Yes

Please provide steps to reproduce where possible

  • Start Java Tomcat container
  • Run our closed source application (this might be a problem, I guess)

Expected Results

Memory usage similar to Temurin 21.0.1 - 21.0.2

Actual Results

Consumes all available memory up to docker container limit.

What Java Version are you using?

openjdk 21.0.5 2024-10-15 LTS OpenJDK Runtime Environment Temurin-21.0.5+11 (build 21.0.5+11-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.5+11 (build 21.0.5+11-LTS, mixed mode, sharing)

What is your operating system and platform?

Ubuntu 20.04 docker host
Container image based on Ubuntu 22.04. Dockerhub: tomcat:9.0.90-jdk21-temurin-jammy

How did you install Java?

Docker image from https://hub.docker.com/_/tomcat, which uses eclipse-temurin:21-jdk-jammy

Did it work before?

Work on the following versions:

openjdk 21.0.1 2023-10-17 LTS
OpenJDK Runtime Environment Temurin-21.0.1+12 (build 21.0.1+12-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode, sharing)

openjdk 21.0.2 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)

Did you test with other Java versions?

Also failing on:

openjdk 21.0.3 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)

Relevant log output

OOM killer (syslog)
Oct 27 02:20:14 <hostname> kernel: [12313888.554885] C2 CompilerThre invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Oct 27 02:20:14 <hostname> kernel: [12313888.554911]  oom_kill_process.cold+0xb/0x10
Oct 27 02:20:14 <hostname> kernel: [12313888.554975] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Oct 27 02:20:14 <hostname> kernel: [12313888.554988] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,mems_allowed=0-1,oom_memcg=/docker/8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,task_memcg=/docker/8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,task=java,pid=1235475,uid=1337
Oct 27 02:20:14 <hostname> kernel: [12313888.555474] Memory cgroup out of memory: Killed process 1235475 (java) total-vm:15036484kB, anon-rss:10207028kB, file-rss:39704kB, shmem-rss:0kB, UID:1337 pgtables:22964kB oom_score_adj:0

Java Native Memory Tracking output (as the problem is happening, Compiler memory usage is growing). Running with Xmx=1G
docker@<hostname>(devxx):~$ docker exec -i tomcat-$serviceGroup jcmd 1 VM.native_memory
Native Memory Tracking:

(Omitting categories weighting less than 1KB)

Total: reserved=8444154KB, committed=6461730KB
       malloc: 5747358KB #685549
       mmap:   reserved=2696796KB, committed=714372KB

-                 Java Heap (reserved=1024000KB, committed=464896KB)
                            (mmap: reserved=1024000KB, committed=464896KB) 
 
-                     Class (reserved=1052409KB, committed=17977KB)
                            (classes #20859)
                            (  instance classes #19720, array classes #1139)
                            (malloc=3833KB #78983) (peak=3849KB #79169) 
                            (mmap: reserved=1048576KB, committed=14144KB) 
                            (  Metadata:   )
                            (    reserved=131072KB, committed=102144KB)
                            (    used=100769KB)
                            (    waste=1375KB =1,35%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=14144KB)
                            (    used=12346KB)
                            (    waste=1798KB =12,71%)
 
-                    Thread (reserved=176700KB, committed=21028KB)
                            (thread #172)
                            (stack: reserved=176128KB, committed=20456KB)
                            (malloc=372KB #1049) (peak=494KB #1365) 
                            (arena=200KB #342) (peak=11107KB #356)
 
-                      Code (reserved=253719KB, committed=63699KB)
                            (malloc=6031KB #15848) (peak=6031KB #15809) 
                            (mmap: reserved=247688KB, committed=57668KB) 
 
-                        GC (reserved=61148KB, committed=50292KB)
                            (malloc=8244KB #26589) (peak=8282KB #26796) 
                            (mmap: reserved=52904KB, committed=42048KB) 
 
-                 GCCardSet (reserved=733KB, committed=733KB)
                            (malloc=733KB #2566) (peak=740KB #2566) 
 
-                  Compiler (reserved=5675077KB, committed=5675077KB)
                            (malloc=1348KB #1300) (peak=1682KB #1416) 
                            (arena=5673729KB #9) (at peak)
 
-                  Internal (reserved=1799KB, committed=1799KB)
                            (malloc=1763KB #51500) (peak=1863KB #53933) 
                            (mmap: reserved=36KB, committed=36KB) 
 
-                     Other (reserved=2742KB, committed=2742KB)
                            (malloc=2742KB #213) (peak=2764KB #255) 
 
-                    Symbol (reserved=30217KB, committed=30217KB)
                            (malloc=25127KB #295303) (peak=25128KB #295307) 
                            (arena=5090KB #1) (at peak)
 
-    Native Memory Tracking (reserved=10905KB, committed=10905KB)
                            (malloc=194KB #3433) (peak=203KB #3579) 
                            (tracking overhead=10712KB)
 
-        Shared class space (reserved=16384KB, committed=12972KB, readonly=0KB)
                            (mmap: reserved=16384KB, committed=12972KB) 
 
-               Arena Chunk (reserved=2780KB, committed=2780KB)
                            (malloc=2780KB #177676) (peak=5681830KB #177677) 
 
-                   Tracing (reserved=32KB, committed=32KB)
                            (arena=32KB #1) (at peak)
 
-                    Module (reserved=290KB, committed=290KB)
                            (malloc=290KB #4410) (peak=297KB #4412) 
 
-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB) 
 
-           Synchronization (reserved=2454KB, committed=2454KB)
                            (malloc=2454KB #24125) (peak=2459KB #24175) 
 
-            Serviceability (reserved=18KB, committed=18KB)
                            (malloc=18KB #37) (peak=66KB #369) 
 
-                 Metaspace (reserved=132724KB, committed=103796KB)
                            (malloc=1652KB #2421) (at peak) 
                            (mmap: reserved=131072KB, committed=102144KB) 
 
-      String Deduplication (reserved=1KB, committed=1KB)
                            (malloc=1KB #8) (at peak) 
 
-           Object Monitors (reserved=14KB, committed=14KB)
                            (malloc=14KB #71) (peak=129KB #637)

Manual logging total_rss per second (from cgroup v1 memstat)
Test container Xmx=1G, container limit 2,5GB. oom-killer disabled.

Tue Oct 29 14:00:00 CET 2024: Total RSS: 933.01 MB
Tue Oct 29 14:00:01 CET 2024: Total RSS: 1071.18 MB
------ Problem starts here ------
Tue Oct 29 14:00:02 CET 2024: Total RSS: 1457.25 MB
Tue Oct 29 14:00:03 CET 2024: Total RSS: 1862.92 MB
Tue Oct 29 14:00:04 CET 2024: Total RSS: 2231.92 MB
Tue Oct 29 14:00:05 CET 2024: Total RSS: 2451.58 MB

------ Reached container mem limit 2,5GB ------
JVM freezes, out of memory, and cannot create new threads:
java.io.IOException: Cannot allocate memory
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

Tue Oct 29 14:00:06 CET 2024: Total RSS: 2451.58 MB
Tue Oct 29 14:00:08 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:09 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:10 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:11 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:12 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:13 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:14 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:15 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:16 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:17 CET 2024: Total RSS: 2456.09 MB

@shalseth shalseth added the bug Something isn't working label Oct 29, 2024
@karianna
Copy link
Contributor

@shalseth What is your app doing roughly? Is it a pure Java app or does it have Scala, other JVM languages or any native components?

@shalseth
Copy link
Author

shalseth commented Oct 30, 2024

@karianna , thank for you reply. No other native components, just pure Java.

It's a large web application with many functions, running in Tomcat.

In this particular case however, it's purely MySQL database transactions using JDBC. Fetching data in a ResultSet, and serializing to JSON using gson.

What strikes me as weird is that we don't see anything like this on Temurin 21.0.1 / 21.0.2 after countless hours in production across several instances, and we can reproduce this in a few minutes on 21.0.3-21.0.5. Currently only from random clicking around in our application, but perhaps we can isolate it further at some point.

I did manage to get some more logs today when reproducing the issue. Used the following flags:

-Xlog:jit+compilation=debug
-XX:+UnlockDiagnosticVMOptions
-XX:LogFile=/var/log/tomcat9/hotspot_pid1.log
-XX:+LogCompilation
-XX:+PrintCompilation
-XX:+PrintCompilation2

Found a compilation task that takes 9995ms , and eats memory at the time I observe memory spikes, seen from outside of the docker container. In this case it consumed about +4GB of RSS memory. Xmx=1G.

-XX:+PrintCompilation / -XX:+PrintCompilation2

  52631 14809 % !   4       <package.name.redacted>.controller.ActivityController::getActivities @ 3075 (4542 bytes)
  62627 14809 % size: 187248(114816) time: **9995** inlined: 5335 bytes

hotspot_pid1.log (-XX:+LogCompilation)

<nmethod compile_id='14809' compile_kind='osr' compiler='c2' level='4' entry='0x00007f08ee6adca0' size='203192' address='0x00007f08ee6aa610' relocation_offset='344' insts_offset='13968' stub_offset='12878
4' scopes_data_offset='134016' scopes_pcs_offset='184104' dependencies_offset='193432' handler_table_offset='193832' nul_chk_table_offset='202736' oops_offset='132440' metadata_offset='132592' method='no.
<package.name.redacted>.controller.ActivityController getActivities (Lxx/xx/portal/core/model/CorporationGroup;Ljava/lang/String;Lxx/xx/portal/xxxxx/model/ActivityFilter;Lxx/xx/portal/core/model/database/OrderColumn;Lxx/xxx/portal/core/model/database/QueryLimit;Lxx/xxx/portal/core/database/Transaction;)Ljava/util/List;' bytes='4542' count='194' backedge_count='114869' iicount='194' stamp='62,606'/>
  62627 14809 % size: 187248(114816) time: 9995 inlined: 5335 bytes

See attached debug log from the compiler thread as well.

hs_c35_pid1.log

@karianna
Copy link
Contributor

Thanks for the extra details - reported upstream at https://bugs.openjdk.org/browse/JDK-8343322

@karianna karianna added jbs:reported Someone from our org has reported it to OpenJDK and removed Waiting on OP labels Oct 30, 2024
@shalseth
Copy link
Author

shalseth commented Nov 3, 2024

@karianna, following up with additional information based on upstream questions. I don't have an account at bugs.openjdk.org, so I would appreciate if you could forward this information.

Question:
Could the reporter try to narrow it down to a specific change by by building the VM himself?

Answer:
I've built OpenJDK from source and isolated the commit that introduces this behaviour:
See: https://bugs.openjdk.org/browse/JDK-8309203

I've built a version of OpenJDK up until the commit before the one above:
commit 8fbd10bc5db476abf76a55a456f811132041301f
Author: Sergey Bylokhov serb@openjdk.org 2024-01-03 22:00:07

This version does not display any of the mentioned symptoms.

I will gladly assist with reproducing the issue, and trying out any patches / changes.
We see the same behaviour on brief tests on current versions of JDK 22 and 23 as well.

Builds / commits:
No issues: 24b3b2b66f0bed3e9812999a5b6d511f31e9ad1f (2023-12-31)
No issues: 8fbd10bc5db476abf76a55a456f811132041301f (2024-01-03)
Issue observed: 966071a5f10a7a8910b2639e10c961924a07d675 (2024-01-04) <--- Issue first observed at this commit
Issue observed: b4c163d07b2db96ce91a0f712ced60ace14dbf63 (2024-01-17)

Question:
And is this really a memory leak? From the report it sounds like it's rather high memory consumption by a single compilation and the memory is freed right after.

Answer:
In this instance, the memory is not released for almost two minutes, so it at least looks like a temporary(?) leak ;)
It's worth to note that although the pattern we see is the same in test/production, we've seen much higher memory usages in the few instances we've run this in production.
Don't know how far it would have gone, but we've hit the container limit after consuming about 9G. Haven't seen that high number in the test environment yet.

Logs:
The abrupt rise in memory consumption is accompanied by a CPU spike, which stops after a few seconds. At the same time as the memory usage stops rising.

Log output from test run from a version of OpenJDK 21.0.3+9 that we've compiled today.

Sun Nov  3 15:46:06 CET 2024: Total RSS: 503.45 MB
Sun Nov  3 15:46:08 CET 2024: Total RSS: 523.17 MB
Sun Nov  3 15:46:09 CET 2024: Total RSS: 803.67 MB
Sun Nov  3 15:46:10 CET 2024: Total RSS: 1371.32 MB
Sun Nov  3 15:46:11 CET 2024: Total RSS: 1945.73 MB
Sun Nov  3 15:46:12 CET 2024: Total RSS: 2519.87 MB
Sun Nov  3 15:46:13 CET 2024: Total RSS: 3088.35 MB
Sun Nov  3 15:46:14 CET 2024: Total RSS: 3653.48 MB
Sun Nov  3 15:46:15 CET 2024: Total RSS: 4113.67 MB
Sun Nov  3 15:46:16 CET 2024: Total RSS: 4133.04 MB
Sun Nov  3 15:46:17 CET 2024: Total RSS: 4128.20 MB
Sun Nov  3 15:46:18 CET 2024: Total RSS: 4107.51 MB
Sun Nov  3 15:46:19 CET 2024: Total RSS: 4060.11 MB
Sun Nov  3 15:46:20 CET 2024: Total RSS: 4035.06 MB
Sun Nov  3 15:46:21 CET 2024: Total RSS: 4035.06 MB
Sun Nov  3 15:46:22 CET 2024: Total RSS: 2815.38 MB
Sun Nov  3 15:46:23 CET 2024: Total RSS: 2815.51 MB
Sun Nov  3 15:46:24 CET 2024: Total RSS: 2815.51 MB
Sun Nov  3 15:46:25 CET 2024: Total RSS: 2815.23 MB
Sun Nov  3 15:46:26 CET 2024: Total RSS: 2814.51 MB
Sun Nov  3 15:46:27 CET 2024: Total RSS: 2814.51 MB
Sun Nov  3 15:46:28 CET 2024: Total RSS: 2814.51 MB
Sun Nov  3 15:46:29 CET 2024: Total RSS: 2814.51 MB
Sun Nov  3 15:46:30 CET 2024: Total RSS: 2793.37 MB
Sun Nov  3 15:46:31 CET 2024: Total RSS: 2793.37 MB
Sun Nov  3 15:46:32 CET 2024: Total RSS: 2793.37 MB
Sun Nov  3 15:46:33 CET 2024: Total RSS: 2793.21 MB
Sun Nov  3 15:46:34 CET 2024: Total RSS: 2793.21 MB
Sun Nov  3 15:46:35 CET 2024: Total RSS: 2793.21 MB
Sun Nov  3 15:46:36 CET 2024: Total RSS: 2793.21 MB
Sun Nov  3 15:46:37 CET 2024: Total RSS: 2793.21 MB
Sun Nov  3 15:46:38 CET 2024: Total RSS: 2793.33 MB
Sun Nov  3 15:46:39 CET 2024: Total RSS: 2793.33 MB
Sun Nov  3 15:46:40 CET 2024: Total RSS: 2793.33 MB
Sun Nov  3 15:46:41 CET 2024: Total RSS: 2794.50 MB
Sun Nov  3 15:46:42 CET 2024: Total RSS: 2794.50 MB
....
Sun Nov  3 15:48:00 CET 2024: Total RSS: 520.06 MB

@TobiHartmann
Copy link

Thanks a lot for narrowing it down @shalseth. I'll put that information into JDK-8343322.

@TobiHartmann
Copy link

Is this still reproducible with JDK-8340824? @shalseth maybe you can take the patch and apply it to see if it helps:
openjdk/jdk@90c944f

@shalseth
Copy link
Author

shalseth commented Nov 4, 2024

@TobiHartmann, JDK-8340824 looks really promising so far!

I tried applying the patch to jdk-21.0.3-ga, which is quite old compared to the new patch, but it applied cleanly, and I see an immediate improvement.

Then did the same test with the patch on jdk-21.0.6+1 (commit: 7dc0f7a64224d37f639ab8e8da2c1aa3295cc92e).

In both cases I can notice a slightly larger memory footprint, with the patch applied, compared to 21.0.2, but it's negligible, and to the point where I wouldn't consider it a problem, or even know about it in the first place.

With the patch applied, the behaviour is much more in line with 21.0.1/21.0.2, and saves around 5-6GB in peak memory consumption.

Now during my test it goes from 500MB -> 2000MB -> 500MB during ~10 seconds, which is completely fine.
With 21.0.2, for comparison, would be around 500MB -> 1600MB -> 500MB, but again, this is not a big deal and won't cause any problems.

Note that this is an early report, so I will do some more testing, perhaps even with production workload.
Unless I have new findings that contradict my current findings with the patch, I would consider it fixed, assuming this patch finds its way into next 21.0.X release.

@TobiHartmann
Copy link

That's great news, thanks for verifying @shalseth. I'll keep JDK-8343322 open for now until final feedback from you.

assuming this patch finds its way into next 21.0.X release

That's up to the OpenJDK maintainers to decide, I'm representing the Oracle JDK here and we backported the fix to Oracle JDK 21.0.7 already.

@rwestrel
Copy link

rwestrel commented Nov 4, 2024

@shalseth I'll propose it for openjdk 21

@jerboaa
Copy link

jerboaa commented Nov 4, 2024

@shalseth I'll propose it for openjdk 21

Thanks, Roland!

@shalseth
Copy link
Author

shalseth commented Nov 5, 2024

Been running jdk-21.0.6+1 (commit: 7dc0f7a64224d37f639ab8e8da2c1aa3295cc92e) with the patch from JDK-8340824 in production on a small subset of containers since yesterday.

We've had no issues so far. The original issue would present itself within max 1-2 hours, so I'm closing the issue.

I'm planning to stick with Temurin 21.0.2 in production for now, and upgrade when a Temurin 21.0.X build based on OpenJDK with the patch applied becomes available. Will do the same testing then.

Thanks a lot for your help and assistance :)

@shalseth shalseth closed this as completed Nov 5, 2024
@TobiHartmann
Copy link

Great, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jbs:reported Someone from our org has reported it to OpenJDK
Projects
None yet
Development

No branches or pull requests

5 participants