Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBD server losing connection to PostgreSQL #154

Open
crabique opened this issue Apr 10, 2023 · 4 comments
Open

NBD server losing connection to PostgreSQL #154

crabique opened this issue Apr 10, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@crabique
Copy link

crabique commented Apr 10, 2023

Describe the bug

When using benji nbd with configuration pointing to a PostgreSQL pool, the NBD server stops initiating new client connections after some period of inactivity.

The following errors can be found in the process output:

   INFO: Incoming connection from <IP>:60948.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:60948]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:60948]: opt=1, length=28, data=b'<UID>'
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>

After that, all new connection requests cause the server to show something like that:

ERROR: [<IP>:44590] NBD_CMD_READ: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Traceback (most recent call last):
  File "/benji/lib64/python3.6/site-packages/benji/nbdserver.py", line 346, in handler
    data = self.store.read(version, cow_version, offset, length)
  File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1340, in read
    read_list = self._block_list(version, offset, length)
  File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1311, in _block_list
    block = version.get_block_by_idx(block_idx)
  File "/benji/lib64/python3.6/site-packages/benji/database.py", line 586, in get_block_by_idx
    block = Session.query(Block).filter(Block.version_id == self.id, Block.idx == idx).one_or_none()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2849, in one_or_none
    return self._iter().one_or_none()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2910, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1712, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1582, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1798, in _execute_context
    conn = self._revalidate_connection()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 645, in _revalidate_connection
    self._invalid_transaction()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 625, in _invalid_transaction
    code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

This behaviour persists until the NBD server is restarted.

To Reproduce

Configure the following databaseEngine:

databaseEngine: postgresql://<PG_USER>:<PG_PASS>@<PG_HOST>/<BENJI_DB>

Try to connect to the NBD server using any client, then close the connection or keep it idle for 10-15 minutes (so that the server doesn't get any new requests), then try to connect again.

Expected behavior

NBD server works reliably without restarting it after every connection drop.

Platform and versions (please complete the following information):

  • benji v0.17.0
  • bitnami/postgresql-ha@10.0.9 helm chart for the database

Additional context

Based on the suggestions here, I managed to make it self-recover in such cases by hotpatching this line in site-packages:

self._engine = sqlalchemy.create_engine(url, pool_pre_ping=True)

Still, this sometimes requires the client to initiate the connection 2+ times for the server to recover and actually accept it, but this more or less fixes the problem.

Please let me know if I you need any more details or if I can help in any way.

@crabique crabique added the bug Something isn't working label Apr 10, 2023
@elemental-lf elemental-lf self-assigned this Apr 25, 2023
@elemental-lf
Copy link
Owner

Thanks for the report! What error message(s) or warnings are you getting with your patch installed? Adding pool_pre_ping=True shouldn't hurt either way but we should try to figure out what we need to do to make the re-connection process completely seamless from the user's POV.

@crabique
Copy link
Author

crabique commented Apr 25, 2023

Session where it fails and then recovers even with pre-ping enabled:

    INFO: [<IP>:32832] disconnecting
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>
    INFO: Incoming connection from <IP>:32834.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32834]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32834]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32834] Negotiated export: <UID>.
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=InternalError('Attempt to acquire lock Version <UID> twice.',)>
    INFO: Incoming connection from <IP>:32836.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32836]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32836]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32836] Negotiated export: <UID>.

@crabique
Copy link
Author

I saw some commits with relevant changes pushed to master, can this be considered resolved? And is there a release/tag I can try?

@crabique
Copy link
Author

crabique commented May 15, 2024

Turns out, the problem was my TCP proxy (ingress-nginx) killing the tcp-proxied PSQL connections after 600s (default stream-timeout value), it works perfectly when the ingress is taken out of the equation or when the timeout is increased.

However, when such timeouts happen, it seems the pool_pre_ping only fixes half the problem, the other half is that the operation fails (btw not only the NBD serve, but also long backup or restore, cleanup etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants