Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Egeria Metadata Server crashes silently while processing massive amount of elements #235

Open
1 task done
juergenhemelt opened this issue Aug 16, 2022 · 46 comments
Assignees
Labels
bug Something isn't working containers

Comments

@juergenhemelt
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

While processing 500+ elements (KafkaTopics from StrimziMonitorIntegrationConnector) the Egeria Metadata Server crashes silently. Last few lines of log:

Tue Aug 16 12:02:43 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:43 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:46 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Relationship
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:46 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:48 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Classification
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:48 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: Element Classified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:49 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:50 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Element
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:50 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:53 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Relationship
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:53 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Classification
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: Element Classified
Tue Aug 16 12:02:53 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: Element Declassified
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-IT-INFRASTRUCTURE-0013 The IT Infrastructure Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-STEWARDSHIP-ACTION-0012 The Stewardship Action Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-DIGITAL-ARCHITECTURE-0012 The Digital Architecture Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-CONSUMER-0012 The Asset Consumer Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-OWNER-0012 The Asset Owner Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-DATA-MANAGER-0013 The Data Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:55 GMT 2022 mds1 Event OMAS-ASSET-MANAGER-0013 The Asset Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:57 GMT 2022 mds1 Event OMAS-GOVERNANCE-ENGINE-0019 The Governance Engine Open Metadata Access Service (OMAS) has sent event of type Metadata change event
Tue Aug 16 12:02:57 GMT 2022 mds1 Information OMAS-GOVERNANCE-ENGINE-0017 The Governance Engine Open Metadata Access Service (OMAS) sent a metadata change event to listening Open Watchdog Governance Action Services: New Element
Tue Aug 16 12:02:57 GMT 2022 mds1 Event OMAS-SECURITY-MANAGER-0013 The Security Manager Open Metadata Access Service (OMAS) has sent event of type: New Element
Tue Aug 16 12:02:58 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic mds1.openmetadata.repositoryservices.enterprise.mds1.OMRSTopic
Tue Aug 16 12:02:58 GMT 2022 mds1 Event OMRS-AUDIT-8009 The Open Metadata Repository Services (OMRS) has sent event of type Instance Event to the cohort topic egeria.omag.openmetadata.repositoryservices.cohort.mds.OMRSTopic.instances

Expected Behavior

Should have processed every KafkaTopic element.

Steps To Reproduce

  • Install metadata server and view server
  • Configure StrimziMonitorIntegrationConnector so that it connects to a Strimzi installation with 500+ topics
  • Start StrimziMonitorIntegrationConnector
  • Wait about 40 minutes

Environment

- Egeria:3.10
- OS: Kubernetes (Openshift 4.9)
- Java: 11.x

Any Further Information?

No response

@juergenhemelt juergenhemelt added bug Something isn't working triage labels Aug 16, 2022
@planetf1
Copy link
Member

planetf1 commented Aug 16, 2022

Do you have further indications of what may have happened.

Was anything recorded in the statefullset or pod stats (kubectl describe)

Was the topic connector the only workload?

@planetf1
Copy link
Member

Also, how was the behaviour after the crash? Did the system recover? Was any data lost? Did the connector resume retrieving topics?

@planetf1
Copy link
Member

For example, I had an issue with my Jupyter container (a bug) which led to the container restarting and I'd see something like:

Containers:
  jupyter:
    Container ID:   cri-o://3e6a2f6d264f9e8c6f052f23e49cb2da6a1cf35c869104c0686da689ffd2604f
    Image:          quay.io/odpi/jupyter:lab-3.4.5
    Image ID:       quay.io/odpi/jupyter@sha256:34f7ddbdebf759e4121e25c1ebe06929a118346ecd8a3ea06dbd17ca75fb6c42
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 15 Aug 2022 17:47:49 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    128
      Started:      Mon, 15 Aug 2022 17:47:46 +0100
      Finished:     Mon, 15 Aug 2022 17:47:48 +0100
    Ready:          True
    Restart Count:  1

I'm not sure how useful it would be, but the exit code may help us confirm running out of heap (for example) could be the cause.

What about the pod itself, rather than kubectl logs how about kubectl logs --previous - is there any extra information there?

@planetf1
Copy link
Member

If none of that helps, we probably either need to try and

  • examine code to guess failure
  • analyse/debug, check allocations when integration connector working normally, via an agent (local or in k8s)
  • capture a heap on failure -- complicated by the fact we'd need to write some k8s logic to ensure we store/copy the dump to a persistent volume
  • enable debug (in case this helps. May not...)

@planetf1
Copy link
Member

Just checking the log again - I presume this is the metadata server -- not where you are running the integration connector?

What repository connector are you using? I'm guessing it could be XTDB?

If we can't get any info from the suggestions above quickly, I wonder if it's worth testing also with local graph or in-memory - that wouldn't completely rule a core issue out (as this could relate to timing, concurrent requests etc) but may give some extra insight.

if it is xtdb, what's the xtdb configuration like?

@planetf1
Copy link
Member

We have limited licenses for 'yourkit' java profiler (there are other tools available including using intellij+async profiler for example), which can help in pinning down memory issues, may be an option if the other approaches don't get us answers.

@juergenhemelt
Copy link
Contributor Author

We use a local Janusgraph Repository. On the first restart after the crash, the server is unable to connect to the JanusGraph. This looks like a problem with JanusGraph. The oc describe does not show any problems. The previous log file is the one I attached to this issue. I will try to reproduce the error and check the events of the Stateful Set and the Pod.

Would you recommend using XTDB instead of JanusGraph?

@planetf1
Copy link
Member

@juergenhemelt I think generally we would suggest xtdb - it looks like a more scalable, reliable approach, and some performance analysis has been done on it.

That being said there's no evidence one way or the other yet to suggest your issue is specific to graph, we don't know. So whilst you could try it, equally it may be good to understand why it broken even with graph.

Were you using the default JanusGraph setup (uses berkeleydb)?

I think capturing the data would help, interesting to see if we can figure out if, for example, the jvm failed due to heap exhaustion. Or perhaps openshift killed it due to resource - perhaps you could also check the openshift dashboard to see what it's reporting?

I'd be interesting in trying to reproduce, though I may struggle to do this until later next week. I presume you just pointed your topic integrator at a strimzi deployment hosting lots of topics? Do you have any assets to share that may speed me trying to reproduce? (server config, chart config etc)?

@planetf1
Copy link
Member

I'm also intrigued by the fact you said the graph failed to work after the first restart?

I think you are using customised helm charts? It's possible a problem there, or perhaps not. Do you have any logs from this failure to work on restart? For example if the metadata collection id stored in the server config document gets out of sync with the actual repository (graph) then egeria will throw an error and, at least in >=3.10, fail to start the server

@planetf1
Copy link
Member

This is an interesting kind of issue - how do we debug apparently the jvm going away. If the cause isn't external and anything shows up in k8s log, I wonder if we need to do something like:

  • set jvm flag to capture heapdump
  • modify the helm charts so that not only is the flag set, but the output either goes to, or is copied to, a persistent volume
  • we can then retrospectively analyze the heap dump

^ I'll raise an issue on this

OR maybe we can somehow modify the startup script of the container so that it does not exit (as you'd normally want) on this error condition, so that we could then interactively connect and analyze

But it still might be easier to just connect to the JVM using an agent, which should give us similar information (ie using YourKit profiler or similar)

For your issue another alternative might be to try and run just the mds server locally

@juergenhemelt
Copy link
Contributor Author

juergenhemelt commented Aug 19, 2022

I have just switched to XTDB in-memory. The good news is as the performance is tremendously better I can test faster than before. The bad news is that I still have the same problem. I will continue analyzing.

@planetf1
Copy link
Member

Good to hear the performance is better.
I'm also completing/fixing/merging #127 which is for our lab demo (notebook) chart to allow spec of additional connectors. I'll look at same in base. This would help me/us setting up a strimzi connector environment, and something I've intended to do for a while.

Please do share any data you capture especially from 'describe' against the k8s objects

@juergenhemelt
Copy link
Contributor Author

In deed the oc describeof the pod gives a lastState:

    State:          Running
      Started:      Mon, 22 Aug 2022 10:49:56 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 22 Aug 2022 10:42:06 +0200
      Finished:     Mon, 22 Aug 2022 10:49:55 +0200

Obviously an out of memory issue.

@juergenhemelt
Copy link
Contributor Author

The ressource limits in Kubernetes were at 1GB memory. I increased it and now it works. For the 547 topics it requires 3.4GB memory. I think that's expected behaviour as I am still on in-memory XTDB.

@planetf1
Copy link
Member

Did you have explicit 'resources' entries in your modified charts? Is that where you were setting the 1GB limit.
The standard charts don't specify a limit by default

When the pod restarted, did it continue processing ok?. In a k8s environment we need to expect pods to fail - for example nodes may fail or be rebooted, and the pod rescheduled elsewhere. It's important processing then continues normally - including resuming where it left off.

Are there any aspects of that which is problematic with the testing you have done?
What happens if you go far beyond 500 topics?
I am also thinking if our jam limits are set such that a failure could occur before aggressive garbage collection has kicked in. I can't see any explicit setting of heap in the container, though we do set MaxMetaSpaceSize to 1G -- it's extremely unlike we could get close to that, but may be worth checking./removing. By default heap is 1/2 physical ram, which should be the allocated size of the container

I think this aspect is worth pursuing - we'd mostly expect java heap to be exceeded (for example if we had an effective memory leak) with an exception before the allocated memory on the container ran out ? Need to look if we following best practice and/or how to do better.

@planetf1 planetf1 self-assigned this Aug 24, 2022
@juergenhemelt
Copy link
Contributor Author

Yes, I have explicitly set resource limits. This is a policy in my organization. It was set to 1GB. I now switched from XTDB in-memory to RocksDB persistence. That brings the used memory down again to less than 1GB. So the in-memory XTDB was the one which took all the memory. It could be good to check that XTDB is not using more than the available memory and log an exception then. But I can imagine that this is not easy to implement.
For me this issue is fixed and you could close it.

@planetf1
Copy link
Member

planetf1 commented Aug 24, 2022

If we get killed by k8s for exceeding limits there's nothing we can do in the java application -- only ensure recovery through the k8s deployment. If we hit a memory exception (ie out of heap) in the jvm there's not much we can do (especially anything that allocates memory). A output message, Maybe (and agree).

I'll keep the issue open to look into this behaviour. there may be something about the container / image we use/ settings that are allowing the jvm to continue unhindered (ie not running aggressive GC) yet where physical is restricted. could we do better. I don't know, but I see it as an important question for production. There's also questions like monitoring our heap best practice, and which gc algorithm. We also need to check ops continue normally after the pod is killed and restarted elsewhere, including for example how we handle commits to the Kafka offset.

Likely a combination of understand, document, improve.

I'll modify the title

@planetf1 planetf1 added bug Something isn't working and removed triage bug Something isn't working labels Sep 30, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@dwolfson
Copy link
Member

dwolfson commented Mar 2, 2023

I am seeing similar (if not the same) behavior when running the Lab Helm charts with the notebook "working-with-standard-models". In particular, if I run with the in-memory repository, the pod silently fails. I am using Rancher-Desktop with an allocation of 14gb and 4 cores. When I switch to the graphDB repository, it usually works but occasionally dies (happened once or twice). I suspect that the JVM may need more resource?

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2023

I'm fairly sure the orchestration environment is killing our jvm - hence why we see no last runtime exception from the jvm.

I was wondering what memory the jvm is seeing

In java 8 timeframe this was an issue. See https://docs.openshift.com/container-platform/3.11/dev_guide/application_memory_sizing.html for example, where it was essential to set flags like

-XX:+UnlockExperimentalVMOptions \
  -XX:+UseCGroupMemoryLimitForHeap \

Moving forward in jvm options, we now have better options. There's now a flag UseContainerSupport, though this is enabled by default. Also we can then control with InitialRAMPercentage, MinRAMPercentage, and MaxRAMPercentage. These all work off the memory available to the container

But we also had a change in the kernel from cgroup1 to cgroups2, and it was only Java 15 that added support for cgroups2. I expect Openshift & other modern distros are using kernels with that.

So we're then back to the pre-container days of incorrect memory calculatinos.

From Java 15 onwards this seems corrected.

So.. now that we are with v4 of egeria, and using 17, it maybe these issues have been addressed by default.

@dwolfson were you running with 3.15 or 4? I wonder if both fail in the same way

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2023

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2023

Checking some real environments

Egeria 3.15, in openshift, 16GB node, default resource limits

sh-4.4$ java -XshowSettings:system -version
Operating System Metrics:
    Provider: cgroupv1
    Effective CPU Count: 4
    CPU Period: 100000us
    CPU Quota: -1
    CPU Shares: 2us
    List of Processors, 4 total:
    0 1 2 3
    List of Effective Processors, 4 total:
    0 1 2 3
    List of Memory Nodes, 2 total:
    0 1
    List of Available Memory Nodes, 2 total:
    0 1
    Memory Limit: 13.08G
    Memory Soft Limit: Unlimited
    Memory & Swap Limit: Unlimited
    Maximum Processes Limit: 44753

openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS, mixed mode, sharing)

This may mean the problem is simpler - we're not setting a container size limit at all -- so it really can be as big as available on a node -- the subtlety above isn't relevant. And each of our processes will see the same, so the potential is for a massive over allocation

So we need to ensure limits are applied - and configurable. This isn't an egeria or even container consideration, rather it's at the deployment level ie in the charts (or operator). We must allow this to be config. Just as other apps do, or when a user is creating a deployment directly in a yaml, they would add

This consideration is part of the sizing activity needed in any deployment.. what to exactly set those values too

Platforms like openshift may also have the ability to dynamically override resource specifications - for example see https://docs.openshift.com/container-platform/4.11/nodes/clusters/nodes-cluster-overcommit.html , though we note that the default is not to overcommit

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2023

Running in Rancher desktop on macOS is interesting:

java -XshowSettings:system -version
Operating System Metrics:
    Provider: cgroupv1
    Effective CPU Count: 6
    CPU Period: 100000us
    CPU Quota: -1
    CPU Shares: 2us
    List of Processors, 6 total:
    0 1 2 3 4 5
    List of Effective Processors, 6 total:
    0 1 2 3 4 5
    List of Memory Nodes, 1 total:
    0
    List of Available Memory Nodes, 1 total:
    0
    Memory Limit: Unlimited
    Memory Soft Limit: Unlimited
    Memory & Swap Limit: Unlimited
    Maximum Processes Limit: Unlimited

openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS, mixed mode)

as here it is telling me there is NO limit !! (Unlimited) which is potentially even worse

@planetf1
Copy link
Member

planetf1 commented Mar 9, 2023

In summary:

  • currently pods will potentially use as much RAM as they can (especially in simpler k8s environments)
  • we may fail before even trying to do garbage collection
  • The k8s environment may offer controls to manage this (ie openshift) by modifying the pod definitions as they are being created
  • our helm charts need to provide a 'sensible' memory limit, at least for demos, to aim for more predictable behaviour (we do already for cts).
  • for the best reliability in the most environments, by default, we need to get the limits as low as we can
  • The limits need to be configurable
  • Applies to all the pods we deploy - ie including ui, nginx, jupyter etc
  • Docs should be updated with an explanation of our defaults, overriding & simple approach, ideally with links to standard k8s docs in this area
  • Note that in production environments this is a specialist skill

@planetf1 planetf1 transferred this issue from odpi/egeria Mar 9, 2023
@dwolfson
Copy link
Member

@planetf1 My testing has been with 3.15 so far - I'm happy to switch to testing with 4 if you think that would be helpful. I was running more tests with the Jupyter Labs today and found several more ways to have Egeria crash - not just loading the huge archive in the working with standards lab, but also with the automated curation and other notebooks - it seems like garbage collection isn't happening (often enough?). It seems like it often crashes if I try to run 2 or 3 notebooks after configuring Egeria..

@planetf1
Copy link
Member

planetf1 commented Mar 14, 2023

Thanks for testing.

Can you try with v4? I think it would be most productive to focus there as we have changes in the base image, JVM level.

That being said, I am almost certain the behaviour will be the same I do suspect there's no garbage collection with the current sizing - running java -XshowSettings:system -version in the jvm will indicate what limits the jvm is working too. My guess is it's physical mem as we have no limit

(so maybe experimenting as below is more fruitful)

Our pods need to have properties similar to

resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1024Mi"
        cpu: "1000m"

currently the charts use

resources: {}

'request' is the MINIMUM required to allow the pod to be scheduled (so adding up all the requests effectively defines min ram needed), whilst the LIMIT is the most that can be used. Units above are MB for ram, and 1/1000 of a cpu unit. CPU is less critical, and probably minimum can be pretty low, and max can be higher

Screenshot 2023-03-14 at 09 19 10

You can try editing the charts and adding in if you wish -- ie replace that section here with the follow example, and install the chart from disk (to do that, be in the directory which contains odpi-egeria-lab, and refer to the charts as just 'odpi-egeria-lab', not 'egeria/odpi-egeria-lab'

we'll need to update the charts to set these with sensible default values, that are also configurable.

So trying to figure out the limits is the next thing - we want them to be as small as possible
We'll try and get it into the charts soon, I suspect I've never hit it as I tend to run with 16GB physical nodes and don't run the more complex labs that previously were 'in development'

Some platforms allow defaults to be set, perhaps using an admission controller (this modifies the kubernetes documents as they are ingested, it's a technique for validation & annotation or enforcing any kind of change/override)

They may also be a default default setting for the k8s implementation - ie defined by the control plane. I don't know if rancher desktop has this -- it's looking that's set to unlimited. At least on openshift it's node physical.

@planetf1
Copy link
Member

The image is from charts/odpi-egeria-lab/templates/egeria-core.yaml -- we have 4 distinct ones for each platform (some refactoring is needed really!)

@planetf1
Copy link
Member

An alternative is to edit the definitions once deployed. Perhaps the easiest way is to edit the deployment ie

kubectl get deployments
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
lab-odpi-egeria-lab-jupyter        1/1     1            1           4d17h
lab-odpi-egeria-lab-nginx          1/1     1            1           4d17h
lab-odpi-egeria-lab-presentation   1/1     1            1           4d17h
lab-odpi-egeria-lab-ui             1/1     1            1           4d17h
lab-odpi-egeria-lab-uistatic       1/1     1            1           4d17h
lab-strimzi-entity-operator        1/1     1            1           4d17h

Then edit one of these ie with

kubectl get statefulset
NAME                           READY   AGE
lab-odpi-egeria-lab-core       1/1     4d17h
lab-odpi-egeria-lab-datalake   1/1     4d17h
lab-odpi-egeria-lab-dev        1/1     4d17h
lab-odpi-egeria-lab-factory    1/1     4d17h
➜  templates git:(atruvia/main) kubectl edit statefulset lab-odpi-egeria-lab-core

Make the same edit as above to the resources section

Then save

Then delete the pod ie with kubectl delete pod xxxx. k8s will then create a new one, this time with the higher limits (as this is what is defined in the deployment)

There are other ways to edit also - you can read more at https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/use-resource-controller-to-dynamically-modify-the-upper-limit-of-resources-for-a-pod

@dwolfson
Copy link
Member

The JVM says

[jboss@lab-odpi-egeria-lab-core-0 ~]$ java -XshowSettings:system -version
Operating System Metrics:
Provider: cgroupv1
Effective CPU Count: 4
CPU Period: 100000us
CPU Quota: -1
CPU Shares: 2us
List of Processors, 4 total:
0 1 2 3
List of Effective Processors, 4 total:
0 1 2 3
List of Memory Nodes, 1 total:
0
List of Available Memory Nodes, 1 total:
0
Memory Limit: Unlimited
Memory Soft Limit: Unlimited
Memory & Swap Limit: Unlimited
Maximum Processes Limit: Unlimited

openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.18.0.10-2.el8_7) (build 11.0.18+10-LTS, mixed mode)
[jboss@lab-odpi-egeria-lab-core-0 ~]$

I will try the above.

@planetf1
Copy link
Member

Memory is unlimited. So why would java do garbage collection ... certainly likely route to having the container killed.

The fix will be to ensure we have a resource declaration

@planetf1
Copy link
Member

I've added support for setting memory limits to the 'lab' chart (prerelease of 4.0)

Docs need to be added.

Quick test - usual data catalog lab worked fine with default values set, whilst for the standard models lab, both 'core' and 'dev' platforms crashed

There is still no exception logged at the end of the prior container run - ie it just stops

@planetf1
Copy link
Member

Tested again and was able to run standard models when egeria pods allocated 2GB with usage peaking around 1400+ MB. Max heap is 80% so we're in that region

I've updated the defaults

It's in chart 4.0.0-prerelease.2

Perhaps you can find some better refinements & validate what values make sense -- at least as a first pass. I think we'll need to return to analyze in more depth

@dwolfson
Copy link
Member

Thanks Nigel, did you test it with "Working with Standard Models" notebook? Which repository was configured? I ran some quick tests this evening with the prerelease chart and had failures with all three repositories even though the memory didn't seem to exceed 1400 MB. I will try to test a bit more carefully tomorrow.

@planetf1
Copy link
Member

Tested std models lab using inmem repo

On a 3x16GB openshift cluster the test ran successully with no pod restarts

For rancher desktop on mac, a 12GB cluster would not allow all pods to be scheduled due to insufficient memory.
A 14GB cluster worked, the lab ran successfully, and the memory used by pods grew to 8.6 GB

This demonstrates a challenge in setting values

  • If we set no limits failures are very likely to occur
  • if we set limits big enough to handle our labs, our memory requirement is a lot higher than before
  • if we distinguish between 'requests' (minimum) and 'limits'
    • being overcommitted means jvms may still fail
    • our container bases it's memory defaults on the 'request' value, so would just end up with a small value instead - causing the original issue
    • we would have to explicitly set min/max - but are then back to potential overcommit

Some reduction is likely possible with closer monitoring using metrics

On rancher, after a test run the memory used was:
NAME CPU(cores) MEMORY(bytes)
lab-odpi-egeria-lab-core-0 1963m 1384Mi << these three are sized at 2GB
lab-odpi-egeria-lab-datalake-0 700m 1748Mi << this is close - and prob ok but needs checking
lab-odpi-egeria-lab-dev-0 161m 1464Mi
lab-odpi-egeria-lab-factory-0 2m 357Mi << we are still not doing this, should we remove? @mandy-chessell - allocated 1GB
lab-odpi-egeria-lab-jupyter-7d7fbcf578-w7289 1m 405Mi << SIzed at 2GB. Can prob be reduced to 1/2 or 2/3 GB?
lab-odpi-egeria-lab-nginx-564f9c47bc-v257l 1m 8Mi << sized at 256MB
lab-odpi-egeria-lab-presentation-799ddfc5f-r6plz 1m 29Mi <<768MB assigned - Managed 46MB after querying 100s of items
lab-odpi-egeria-lab-ui-599868cc9-jkx8j 2m 256Mi << 1GB
lab-odpi-egeria-lab-uistatic-5d8ccf677b-sr62s 1m 8Mi << 256MB
lab-strimzi-entity-operator-5969bd794-bwqr6 27m 775Mi
lab-strimzi-kafka-0 107m 859Mi << allocated 1GB
lab-strimzi-zookeeper-0 10m 220Mi << I allocated 768

So
2 + 1.25 + .2 + .5 + .2 + .5 ie ~ 4.7GB reduction ? might reduce footprint to c. 10GB if we remove 'factory' and tighten the other components a little. (before more involved investigations)

@planetf1
Copy link
Member

with the graph repository, this notebook took 16 minutes to run (instead of <2 minutes - rancher desktop). No failure, but rather unusable. It did complete with no pod restarts, and similar amount of memory used (but need jvm stats to properly analyse)

@planetf1
Copy link
Member

planetf1 commented Mar 20, 2023

Resource limits & jvm options can currently be set by creating a file ie ~/etc/rs.yaml containing

egeria:
  core:
     jvmopts: "-XX:+HeapDumpOnOutOfMemoryError"
     resources:
       limits:
         cpu: 2000m
         memory: 3000Mi
       requests:
         cpu: 200m
         memory: 3000Mi

This entry is for 'core'. For our other platforms add similar sections for 'datalake', 'dev', 'factory', and also for our ui components 'ui', 'uistatic', 'presentation'

Optionally settings can be added for our other containers 'nginx', 'zookeeper', 'kafka' with the same resources section ie:

nginx:
  resources:
    limits: 100m

etc

Once we have understand the behaviour, and established sensible defaults, then we can consider how to make this simpler for the user.

Another important part here is the 'jvmopts' which applies to all of our containers

One useful value is to set it to "-Dmanagement.endpoints.web.exposure.include=*" which enables the spring actuator endpoint, which is probably the easiest way to capture metrics.

Here's an example just to enable metrics (but no changes to default resources)

egeria:
  core:
    jvmopts: "-Dmanagement.endpoints.web.exposure.include=*"
  datalake:    
    jvmopts: "-Dmanagement.endpoints.web.exposure.include=*"
  dev:
    jvmopts: "-Dmanagement.endpoints.web.exposure.include=*"
  factory:
    jvmopts: "-Dmanagement.endpoints.web.exposure.include=*"

If this proves useful, we can make this a default setting - perhaps even in the egeria server chassis itself, or else just the container or charts

See odpi/egeria-docs#718 where I've added some examples of using this URL, and through which we will provide further docs on this

planetf1 added a commit to planetf1/egeria-charts that referenced this issue Mar 21, 2023
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
@planetf1
Copy link
Member

See #245 which adds some example snippets to help define resources for the containers used in lan chart

--

I've been looking at the memory profile when importing CIM using the YourKit(tm) analyser

The default when run locally is to use G1 - this shows fairly smooth behaviour with minimal (<1ms) GC delays. It does allow memory to get close to the heap size before kicking in, leading to some sawtooth behaviour, but it's consistent.

The default in our container is to use ParallelGC. This shows different behaviour - it cleans garbage much more quickly, and keeps the memory footprint smaller (as low as 250MB in a 1GB jvm), but we see large numbers of GC cycles. Additionally GCs block threads for an extended period, with frequent delays approaching 100s

This was replicated locally using

 JAVA_TOOL_OPTIONS="-Xmx1024m -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -     Dmanagement.endpoints.web.exposure.include=*"

Some differences from container -- it doesn't set -Xmx, but instead sets a fraction of ram. This makes sense in the resource limited container where we set a 1GB or 2GB limit, but on a 32GB machine locally we need te be more precise.

Of these parameters the option to exit the vm on OOM makes sense, as this is in a k8s environment where we would want the pod to restart, rather than malfunction.

These settings are inherited from the red hat ubi9 openjdk containe we use - they are defaults. However they seem inconsistent with our workload

@planetf1
Copy link
Member

Further observation. As suspected, with different platforms returning varying values of physical memory, for now
I will set actual memory limits for max heap in java -- using the %'s is too variable.

ie

jvmopts:"-Xlog:os+container=trace-XX:+ExitOnOutOfMemoryError-XX:+UseG1GC -Xlog:gc -Xms410m -Xmx820m -Dmanagement.endpoints.web.exposure.include=*"

for a container with a request/limit of 1GB

However even with this set, the datalake container is killed by the OOS with OOM ie the last trace is

[199.595s][trace][os,container] total physical memory: 16797204480
[199.595s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[199.596s][trace][os,container] Memory Limit is: 1048576000
[199.596s][trace][os,container] Path to /memory.usage_in_bytes is /sys/fs/cgroup/memory/memory.usage_in_bytes
[199.596s][trace][os,container] Memory Usage is: 937668608

Note

  • in this case the memory limit of the container is correct - 1GB, and the jvm is also aware of physical ram - 16GB
  • the last usage we see is a little under this. We know exceeding will get the process killed by k8s

So what is java doing?

At least in terms of the heap & gc:

[192.983s][info ][gc          ] GC(39) Concurrent Mark Cycle 322.952ms
[194.153s][info ][gc          ] GC(40) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 342M->116M(410M) 20.224ms
[194.199s][info ][gc          ] GC(41) Pause Young (Mixed) (G1 Evacuation Pause) 120M->113M(410M) 31.901ms
[195.762s][info ][gc          ] GC(42) Pause Young (Normal) (G1 Evacuation Pause) 346M->119M(410M) 12.718ms
[197.606s][info ][gc          ] GC(43) Pause Young (Normal) (G1 Evacuation Pause) 345M->119M(410M) 14.162ms
[199.166s][info ][gc          ] GC(44) Pause Young (Concurrent Start) (G1 Evacuation Pause) 346M->120M(410M) 20.264ms
[199.166s][info ][gc          ] GC(45) Concurrent Mark Cycle
[199.354s][info ][gc          ] GC(45) Pause Remark 132M->132M(410M) 15.708ms
[199.439s][info ][gc          ] GC(45) Pause Cleanup 135M->135M(410M) 0.279ms
[199.445s][info ][gc          ] GC(45) Concurrent Mark Cycle 278.605ms

This shows the heap at the time of failure was only 410MB - which is well under the 1GB system limit.

This means we need to look at other memory consumption within the pod - both in terms of other java data as well as any other processes. From local tests, the other java usage should be small (which is why 80% for heap is a good starting point)

@planetf1
Copy link
Member

planetf1 commented Mar 22, 2023

Restricting the metaspace to 100MB resulted in

Terminating due to java.lang.OutOfMemoryError: Metaspace

200MB & egeria runs fine, but even with the resource limit set at ~1.2GB (heap limit about 800, metaspace 800), we still exceed permitted memory usage & be terminated.

Even if we don't enable -XX:+ExitOnOutOfMemoryError (typical for k8s) we no sign from the jvm stating it has exceeded any of it's memory limits such as heap

This might suggest memory is allocated from the native process heap.

Kafka has a buffer.memory but this defaults to 32MB

@planetf1
Copy link
Member

planetf1 commented Mar 22, 2023

As an example of java non-heap usage: - this is well into the test - locally - and we're around 300MB: (and only around 400MB long-term on the heap) - albeit with an 8GB jvm limit

Screenshot 2023-03-22 at 12 18 41

@planetf1
Copy link
Member

Retesting locally, and this same Java process was reporting 3390MB at an OS level:

PID    COMMAND      %CPU  TIME     #TH    #WQ  #PORTS MEM    PURG   CMPRS  PGRP  PPID  STATE    BOOSTS             %CPU_ME %CPU_OTHRS UID  FAULTS    COW   MSGSENT    MSGRECV
53338  java         131.5 27:09.10 293/2  1    20591+ 3291M+ 0B     1175M- 53076 1     running  *0[1]              0.00000 0.00000    501  697133+   634   311638+    155746+

Screenshot 2023-03-22 at 12 23 21

@planetf1
Copy link
Member

Screenshot 2023-03-22 at 12 34 16

This was taken with heap limited to 1GB, but no other limits/jvm customization.

-Xmx:1024m
-> Heap using 313MB, though limit set to 910MB
-> non-heap getting toward 300MB
RSS - 1.35GB
-> About 120MB unaccounted for - kafka buffers? And with latency this could grow higher on the container environment?

@planetf1
Copy link
Member

Kafka produces will use up to buffer.size (ie 32MB) to store messages client side if they cannot be sent to the message broker quick enough, the client API will block for up to max.block.ms (60s) before returning a client API failure

this buffer size will apply to each kafka producer.

OMRS Instances will have the highest workload, and for coco we have 3 cohorts (so ~100MB ram)

Worth trying

  • Reduce buffer size -- though could result in other failures due to timeouts
  • Increase block timeout -- though could have other knock on effects
  • Increase memory allocation : worth experimenting to see if there is a fixed limit over and above java heap+non-heap that is needed to keep stable running
  • Improve kafka throughput - add more replicas

@planetf1
Copy link
Member

Tested again, 4GB resource, 1GB heap. Core & datalake still approach real memory

Core - maxes at 2.7GB
Datalake - maxes at 1.8GB
dev - maxes at 1.4GB
factory - maxes at 0.8GB

@planetf1
Copy link
Member

I’m going to persue:

Need to understand memory usage in more detail. There is a lot of potential in terms of file system buffers, even outside the heap & metaspace.

For now the prudent approach is to remove memory limits & jvm configuration -- the option to add them remains, and snippets are provided in config/values. This at least should mean most of the scenarios work most of the time on a decent configuration (the goal was to be more precise than this, but we don't have all the information to do this)

@dwolfson New charts are published. Would you like to try with the defaults?

@planetf1 planetf1 assigned dwolfson and unassigned planetf1 Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working containers
Projects
No open projects
Status: No status
Development

No branches or pull requests

3 participants