Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus-Full-Nodes - resources usage issue #2935

Open
4 tasks
tty47 opened this issue Dec 14, 2023 · 8 comments
Open
4 tasks

Consensus-Full-Nodes - resources usage issue #2935

tty47 opened this issue Dec 14, 2023 · 8 comments
Labels
bug Something isn't working priority:high optional label to track the relative priority of planned items WS: Maintenance 🔧 includes bugs, refactors, flakes, and tech debt etc

Comments

@tty47
Copy link
Contributor

tty47 commented Dec 14, 2023

Summary of Bug

Hello team! 👋

I want to report to you an issue that we are facing with the consensus-full-nodes, we have been facing issues when trying to run the nodes with less than 20GB of RAM, by the time the nodes have to sync the chain (for example in mocha) the nodes cannot sync with less than this amount of resources. Even if we set the maximum to 20GB they are trying to get all the resources of the server until they get OOM and crash.
This happens if the nodes have to sync either from scratch or since some days ago.

I consider, that they should work with the resources that they have, even if that means that it's going to take longer to sync, but crashing because of that, might be an issue.

resources-usage

I assume that the nodes should work even with 8GB, as we have here

cc: @celestiaorg/devops

Version

v1.3.0

Steps to Reproduce

Start a consensus-full-node and connect it to an existing chain (mocha for example), then, it will have to sync to catch up.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@tty47 tty47 added the bug Something isn't working label Dec 14, 2023
@rootulp
Copy link
Collaborator

rootulp commented Dec 14, 2023

Thanks for the issue! A few questions:

  1. Based on the screenshot: do you have any idea on why this is happening for consensus-full-snapshot-0 and not for consensus-full-2 or consensus-full-3? How does conensus-full-snapshot-0 differ from the other consensus nodes? Perhaps a config change?
  2. Does this repro on other chains besides mocha-4?

@evan-forbes
Copy link
Member

I think we ran into this before, and becuase of all of the ibc memo spam on mocha, the tx-index will eat gobs of memory. Was the tx-index configured to be on @jrmanes ?

@tty47
Copy link
Contributor Author

tty47 commented Jan 5, 2024

hey guys!
sorry for the delay in my response, I just read your messages

  1. Based on the screenshot: do you have any idea on why this is happening for consensus-full-snapshot-0 and not for consensus-full-2 or consensus-full-3? How does conensus-full-snapshot-0 differ from the other consensus nodes? Perhaps a config change?

This happens for every node, the image shows the consensus-full-snapshot-0 because was the one that we had to restart at this specific moment, but it doesn't matter the node, it's nothing in particular to this one.
We have seen it for the others as well.

  1. Does this repro on other chains besides mocha-4?

We could see it also in Arabica, the problem of reproducing it in other chains is that it mostly happens when the chain is big in terms of data, so we cannot reproduce it easily in robusta for example

I think we ran into this before, and becuase of all of the ibc memo spam on mocha, the tx-index will eat gobs of memory. Was the tx-index configured to be on @jrmanes ?

I would say yes, we have it defined here, please let us know if there is something we can tweak

The main problem that I see guys, is that this kind of issue are hard to detect in Robusta, we need to have a chain with a lot of data already and connect a node to sync it, then, we will see this scenario.
Let us know if you see something we do from our side to help

@rootulp
Copy link
Collaborator

rootulp commented Jan 5, 2024

Based on the lines you linked, it looks like the tx indexer is enabled and set to kv which uses goleveldb. If the consensus nodes don't need to index transactions, we can remove those lines because the default is null which disables tx indexing. If we do need tx indexing, we can explore alternative db_backends and/or the psql option.

@evan-forbes evan-forbes added WS: Maintenance 🔧 includes bugs, refactors, flakes, and tech debt etc needs:triage labels May 17, 2024
@ninabarbakadze ninabarbakadze added the needs:grooming items needs a status change, to be closed, or clearer AC. Item should be discussed next sync. label May 28, 2024
@evan-forbes evan-forbes added priority:high optional label to track the relative priority of planned items and removed needs:grooming items needs a status change, to be closed, or clearer AC. Item should be discussed next sync. labels Jul 8, 2024
@evan-forbes
Copy link
Member

if this is purely related to the kv indexer, can we perhaps close this issue and open a new one to improve the KV?

@evan-forbes
Copy link
Member

we should be able to close this issue once we are able to run v2 in production, as that includes a new version of the kv that should remedy the massive amounts of memory used by the existing kv store

@rootulp
Copy link
Collaborator

rootulp commented Jul 24, 2024

Nina backported celestiaorg/celestia-core#1405 to celestia-core v1.38.0-tm-v0.34.29 which was released in celestia-app v1.13.0. Rachid bumped celestia-node to that release in this PR.

TLDR: we don't need to wait until celestia-app v2 is running in production. As soon as celestia-node cuts a release from main (likely v0.15.0), we can use the lightweight tx status work.

@tty47
Copy link
Contributor Author

tty47 commented Jul 26, 2024

hello!
should I then close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high optional label to track the relative priority of planned items WS: Maintenance 🔧 includes bugs, refactors, flakes, and tech debt etc
Projects
None yet
Development

No branches or pull requests

4 participants