Fix memory leak in pods #474

arichiardi · 2016-06-23T22:37:39Z

This was particularly evident when doing boot watch test and has been fixed thanks to Micha's
intuition of adding a clojure.core/shutdown-agents in boot.pod/destroy-pod.

Fixes adzerk-oss/boot-test#26

arichiardi · 2016-06-26T00:48:59Z

This still feels not right, please don't merge it.

I tracked down the call to the clojure agent pool to be here.

More to come 😄

stephenbrady · 2016-06-26T02:21:23Z

The call to shutdown-agents is actually already done in the .close method on the pod, which is a call to this.

Having spent a lot of time profiling for memory leaks with pods (see #268), my own experience (once those fixed were made) has been that either your own code or libraries are creating resources that need to be cleared. core.async, for example, creates a global executor service that must be freed otherwise you'll see a class leak at the very least since Clojure itself cannot be gc'd.

We use this as our custom destroy-pod function:

(defn destroy-pod
  [^ClojureRuntimeShim pod]
  (when pod
    (boot.pod/with-eval-in pod
      (when (find-ns 'clojure.core.async.impl.exec.threadpool)
        (let [e (resolve 'clojure.core.async.impl.exec.threadpool/the-executor)]
          (.shutdownNow ^java.util.concurrent.ExecutorService (var-get e))))

      (when (find-ns 'clojure.core.async.impl.timers)
        ; This will cause the thread to go into a terminated state (and print a ThreadDeath error)
        (let [t (resolve 'clojure.core.async.impl.timers/timeout-daemon)]
          (try (.interrupt ^Thread (var-get t))
            (catch Throwable t))))

      (when (find-ns 'datomic.api)
        (let [f (var-get (resolve 'datomic.api/shutdown))]
          (f false))))

    (.close pod)

    (.close ^java.net.URLClassLoader (.getClassLoader pod))))

arichiardi · 2016-06-26T02:23:56Z

Oh, I am using core.async in my app, let me try that!

arichiardi · 2016-06-26T02:30:29Z

Same applies to thread pools I guess...which is maybe why you have the datomic shutdown...

stephenbrady · 2016-06-26T02:33:26Z

Yep, any resource at all will do it. It's unfortunate that a decent number of libraries create resources for you by virtue of requiring them and don't properly expose those resources to be managed.

arichiardi · 2016-06-26T02:36:11Z

Yes unfortunate, I avoided starting the db and added your destroy fn but still the memory grows and grows...I am using the Hikary pool, which now I need to try to shutdown properly I guess, btw thanks for the hint.

stephenbrady · 2016-06-26T02:39:01Z

Sure thing. Good luck. I spent way too many hours chasing this; I feel the pain.

arichiardi · 2016-06-26T02:39:18Z

It is still very weird that the biggest memory consumption happen in clojure-agent-send-off-pool-1...

arichiardi · 2016-06-26T02:46:20Z

Ah well, not at all weird as we are running inside a future...https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/core/src/boot/core.clj#L945

stephenbrady · 2016-06-26T02:49:27Z

Yep, the symptom, not the cause.

arichiardi · 2016-06-26T02:57:36Z

So basically every defstate I have is not an atom so it contains a reference to whatever it is right? So every defstate should actually be an nillable atom...

stephenbrady · 2016-06-26T03:03:36Z

Seems like you're using mount. I'm only somewhat familiar with mount, but I thought it offered an api to terminate the resource held by the defstate var. Otherwise, you can just alter-var-root and set the var to nil (or explicitly close the resource at the var, if it has such a method)...or do the equivalent with an atom.

arichiardi · 2016-06-26T03:07:08Z

Yes it does provide that, on stop I will nil the atom(s) that I am going to use instead of having pure references... of course this is only a leak during recurring testing, potentially all apps with state leak in this way...

arichiardi · 2016-06-26T04:33:27Z

This also means that each boot task which wants to be used in a watch should provide a way to hook up a destroy function..

arichiardi · 2016-06-29T17:40:03Z

About this, maybe we should enable custom code in destroy-pod, kind of like cleanup but for a specific pod?

micha · 2016-06-29T18:06:50Z

Perhaps JMX can be used to find leaks automatically?

The pod-pool function already accepts a :destroy option that can be used to finalize things in the pod, and if you're calling destroy-pod manually you can evaluate any code you like in the pod before you do that, no?

micha · 2016-06-29T18:08:50Z

Ah, maybe something like a shutdown hook that can be installed in the pod to perform cleanup...destroy-pod would call those before finishing?

arichiardi · 2016-06-29T18:08:56Z

Yes you are right, there are already many hooks, but the problem arises when a task (boot-test) is executing in a pod and I don't have a way to execute custom custom at destroy time..

arichiardi · 2016-06-29T18:10:04Z

Yeah exactly

micha · 2016-06-29T18:11:26Z

makes sense to me

arichiardi · 2016-06-29T18:15:40Z

Maybe instead of an oblect here https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/base/src/main/java/boot/App.java#L326 we pass an (thread-safe) container of functions? I was checking where/when it is better to add this

micha · 2016-06-29T18:18:10Z

Or an atom in boot.pod that you can conj thunks onto. Then destroy-pod could do something like

(with-eval-in p (doseq [x @boot.pod/destroy-hooks] (x)))

arichiardi · 2016-06-29T18:19:46Z

Yes that would work too

arichiardi · 2016-06-29T20:30:28Z

Any objection on having a

(defn destroy-pod
  "Closes open resources held by the pod, making the pod eligible for GC."
  [pod]
  (when pod
    (*destroy-pod-hook*) ;; dynamic
    (.close pod)
    (.. pod getClassLoader close)))

instead?

The caller would need to do (simplified):

(boot
  (binding boot.pod/*destroy-pod-hook* my-cleaning-fn
    (boot-test/test)))

arichiardi · 2016-06-29T20:43:40Z

The latter also means that all the code inside the binding will "see" the destroy function, aka, all the call to pods created inside the binding will call that function.
This is not a surprising behavior, the contrary, giving a super fine grained control (i.e.: one custom destroy hook per pod), is not really a common use case imho.

micha · 2016-06-29T20:46:02Z

Possibly... although it seems like there is a more elegant solution somewhere, maybe?

Another option to consider is reaping defunct pods out-of-band. Another thread could periodically iterate over the pods in boot.pod/pods and if they've been destroyed it could perform actions on them to force them to die.

I would really prefer some way to use reflection or JMX debugging APIs or something like that to find threads that need to be shut down, automatically, without the need for hooks.

I would also like to find out which things can prevent a pod from being eligible for GC.

micha · 2016-06-29T20:50:18Z

Regarding the *hooks* vs. pod/hooks, it might be a "bothism"; we might want both. The *hooks* thing is good when you need to inject finalization procedures into code that you don't control (like your boot-test/test example). The pod/hooks approach, on the other hand, allows the code that creates the pod to also install the finalization code, which isn't really feasible with the dynamic vars way.

arichiardi · 2016-06-29T20:54:57Z

Good points in both your comments, I was already working on this, but then stopped.
In the particular problem I am facing, I cannot know the pod boot.test is creating and unless boot.test/test itself provides a hook in the params, I cannot execute any cleaning code.

I could create a task that does what you said above though, I am in a watch, and I could scan the pods, see the ones that have been destroyed (how to identify them?) and just remove their reference/step heavily on them so that they really die 😄

arichiardi · 2016-06-29T20:56:02Z

Also an atom probably would not solve my problem because again, how can I know the pod boot.test/test used?

micha · 2016-06-29T20:58:02Z

Yes, that's why maybe we need both. I mean if I make a task that creates a pod and my task is loading core.async in the pod, then my task should also ensure it gets cleaned up. How would that work with the dynamic var?

arichiardi · 2016-06-29T21:00:45Z

Can you do a binding inside a binding? Well theoretically maybe yes but I agree it is not really intuitive.

micha · 2016-06-29T21:07:22Z

You can use set! on the binding, but it just seems like a last resort. If we can solve this by reflection or debugger APIs and without any hooks that would be the ideal solution.

arichiardi · 2016-06-29T21:09:49Z

Yeah..there are ways of solving this with brute force without changing boot code, for instance I could use robert.hooke in order to wrap destroy-pod. If noone else is going to have this issue of course we can leave it there...

In order to fix memory leaks in task-local pods, we need a way to execute custom cleaning at pod destroy time. This was particularly evident when doing `boot watch test` in a resource heavy app. Fixes adzerk-oss/boot-test#26

micha · 2016-06-29T21:11:53Z

Or (alter-var-root #'boot.pod/destroy-pod #(comp % my-destroy-function))...

arichiardi · 2016-06-29T23:56:53Z

Uhm this is odd, I have to the say that now I am a bit lost http://pastebin.com/ZHVZ4rk8

arichiardi · 2016-06-30T00:01:31Z

Being a pod pool in boot.test I am not surprised we reuse pods/classloaders, but the above shows UrlClassLoader are not reused (one leaks). The Agent CL part is actually:

(run! #(util/warn "Agent CL %s\n" %)
                  (->> (Thread/getAllStackTraces)
                       (keys)
                       (map #(.getContextClassLoader %))
                       (into #{})))

micha · 2016-06-30T00:02:21Z

The pod pool creates new pods, it doesn't reuse them...

arichiardi · 2016-06-30T00:03:24Z

For completeness, the above is generated with:

(with-pass-thru _
            (run! #(util/warn "POD CL %s\n" %)
                  (->> boot.pod/pods
                       keys
                       (map #(.getClassLoader %)))
                       (into #{}))
            (run! #(util/warn "Agent CL %s\n" %)
                  (->> (Thread/getAllStackTraces)
                       (keys)
                       ;; (filter #(re-find #"clojure-agent" (.getName %)))
                       (map #(.getContextClassLoader %))
                       (into #{})))

arichiardi · 2016-06-30T00:48:43Z

Some simple set mathematics and I have the classloader that leaks:

([#object[java.net.URLClassLoader 0x5ce1d55 "java.net.URLClassLoader@5ce1d55"]
  [#object[org.apache.logging.log4j.core.util.Log4jThread 0x5fb48c2 "Thread[Log4j2-Log4j2Scheduled-1,5,main]"]]])

arichiardi · 2016-06-30T01:20:10Z

Oh man, really? https://logging.apache.org/log4j/2.x/manual/migration.html

Remove calls to org.apache.log4j.LogManager.shutdown(), they are not needed in version 2 because the Log4j Core now automatically adds a JVM shutdown hook on start up to perform any Core clean ups.

Starting in Log4j 2.1, you can specify a custom ShutdownCallbackRegistry to override the default JVM shutdown hook strategy.
Starting in Log4j 2.6, you can now use org.apache.logging.log4j.LogManager.shutdown() to initiate shutdown manually.

arichiardi · 2016-06-30T03:56:27Z

Oddly enough, alter-var-root does not seem to work from outside the pod probably because I need to alter the destroy-pod inside it?

arichiardi · 2016-06-30T19:02:23Z

@micha I went through a debugging session again and this is the result:

http://pastebin.com/NN0h9uA1

Maybe I asked you this question already, but shouldn't destroy-pod remove the reference in pod/pods when done?

I excluded the primary cause of my leak (log4j2), but by debugging I noticed this as well and I think it should be fixed, but maybe again I don't see other factors in play now.

As you can see in the bin as well, the pod-pool correctly makes and destroys (when :refresh is called) pods.

micha · 2016-06-30T19:05:12Z

The pod/pods is a WeakHashMap, shouldn't interfere with GC I don't think.

arichiardi · 2016-06-30T19:11:13Z

Ah right I remember now 😂 it is my printing keeping it there then? Because I see it growing and growing, even if I trigger the GC in VisualVM they are not GC-ed..someone is keeping them there

micha · 2016-06-30T20:03:15Z

Seems like that's the symptom, not the cause. Those keys will remain in the weak hash map until the pods are collected by GC, not the other way around.

arichiardi · 2016-06-30T21:47:28Z

You are right, sorry to bother with these weird questions, but I am more and more confused. If I trigger the GC manually then the keys in pods correctly disappear. If I never do it, I eventually end up having java.lang.OutOfMemoryError: Metaspace.
It is my first time dealing with this kind of issue and I really would like to understand the dynamics, even if I might seem super newbie (well, I am ah ah).
What is going on here? 😄

arichiardi · 2017-07-11T16:29:07Z

I can close this one, nobody has complained and I have not found a solution anyways 😄

martinklepsch · 2017-07-11T16:31:37Z

@arichiardi but don't we all agree that it still exists? If so I think we should keep it open just for reference.

arichiardi · 2017-07-11T16:33:23Z

To be honest I don't know if it still exists. I can leave it open yes, maybe it is better after all.

martinklepsch · 2017-07-11T16:48:24Z

Just a note to myself, potential steps to reproduce:

create lots of pods/pools
obtain references to all of them via boot.pod/pods
shutdown these pods
check with YourKit or JVisualVM for stale classes etc

arichiardi force-pushed the fix-pod-leak branch from ce8174c to 423f7ff Compare June 23, 2016 22:39

arichiardi force-pushed the fix-pod-leak branch from 423f7ff to d36ee42 Compare June 29, 2016 20:49

Add boot.pod/*destroy-pod-hook*

4b66f83

In order to fix memory leaks in task-local pods, we need a way to execute custom cleaning at pod destroy time. This was particularly evident when doing `boot watch test` in a resource heavy app. Fixes adzerk-oss/boot-test#26

arichiardi force-pushed the fix-pod-leak branch from d36ee42 to 4b66f83 Compare June 29, 2016 21:11

micha added the help wanted label Jul 30, 2016

arichiardi closed this Jul 11, 2017

arichiardi reopened this Jul 11, 2017

Fix memory leak in pods #474

Are you sure you want to change the base?

Fix memory leak in pods #474

Conversation

arichiardi commented Jun 23, 2016 • edited by micha Loading

arichiardi commented Jun 26, 2016

stephenbrady commented Jun 26, 2016

arichiardi commented Jun 26, 2016

arichiardi commented Jun 26, 2016

stephenbrady commented Jun 26, 2016

arichiardi commented Jun 26, 2016

stephenbrady commented Jun 26, 2016

arichiardi commented Jun 26, 2016

arichiardi commented Jun 26, 2016

stephenbrady commented Jun 26, 2016

arichiardi commented Jun 26, 2016

stephenbrady commented Jun 26, 2016

arichiardi commented Jun 26, 2016 • edited Loading

arichiardi commented Jun 26, 2016 • edited Loading

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016

arichiardi commented Jun 29, 2016 • edited Loading

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016 • edited Loading

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016 • edited Loading

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016 • edited Loading

arichiardi commented Jun 29, 2016 • edited Loading

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016

micha commented Jun 29, 2016

arichiardi commented Jun 29, 2016

arichiardi commented Jun 30, 2016 • edited Loading

micha commented Jun 30, 2016

arichiardi commented Jun 30, 2016 • edited Loading

arichiardi commented Jun 30, 2016

arichiardi commented Jun 30, 2016

arichiardi commented Jun 30, 2016

arichiardi commented Jun 30, 2016

micha commented Jun 30, 2016 • edited Loading

arichiardi commented Jun 30, 2016

micha commented Jun 30, 2016

arichiardi commented Jun 30, 2016 • edited Loading

arichiardi commented Jul 11, 2017

martinklepsch commented Jul 11, 2017

arichiardi commented Jul 11, 2017

martinklepsch commented Jul 11, 2017

arichiardi commented Jun 23, 2016 •

edited by micha

Loading

arichiardi commented Jun 26, 2016 •

edited

Loading

arichiardi commented Jun 26, 2016 •

edited

Loading

arichiardi commented Jun 29, 2016 •

edited

Loading

micha commented Jun 29, 2016 •

edited

Loading

arichiardi commented Jun 29, 2016 •

edited

Loading

micha commented Jun 29, 2016 •

edited

Loading

arichiardi commented Jun 29, 2016 •

edited

Loading

arichiardi commented Jun 30, 2016 •

edited

Loading

arichiardi commented Jun 30, 2016 •

edited

Loading

micha commented Jun 30, 2016 •

edited

Loading

arichiardi commented Jun 30, 2016 •

edited

Loading