-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with connection pooling #966
Comments
You can always pass in your custom headers (specify Other than that I have no good idea how to debug/reproduce this either as I would assume it is heavily dependent on the environment. Either way we are using |
I'm experiencing issues with connection pooling. elasticsearch == 7.8.0 The issue seems to occur when the ES connection is idle for more then 4 seconds The following script helps in showing the problem. #!/usr/bin/python3 def init_es(user, password, hosts): def output_lsof(count, count_max, hosts): async def main(args): if name == 'main': when running with a 'sleep' of 1 second, for 10 tries (no ping) there is no issue when running with a 'sleep' of 5 seconds, for 10 tries (no ping) I'm seeing increasing connection counts when running with the same test again, but with 'ping' enabled, no issue Hope this info helps in determining the issue. |
Hi I am still having this issue when I try to connect to my mongodb database using mongo connector. In the middle of indexing, I will receive Errno 111 connection refused, then the rest of my database is indexed, but there are many entries in that timeframe that are completely ignored. Is there a fix for these random timeouts that I could apply to elasticsearch-py version 5.0? That is the latest version that is compatible with mongo-connector. Thanks. |
Hi, I've been encountering similar issues. TLDR: frequently used connections work fine, but processes with long-lived connections generate retries after timeout. ConfigurationElasticsearch version: 7.12.0 Our topology the following: haproxy load balancer -> 2 client nodes -> (3 master / 4 data nodes). SymptomsWe initially decided to recreate a connection each time (following @honzakral solution), but this has an impact on latencies and we cannot afford it anymore (higher latencies when there is a high rate of connections creations). So we're trying to reuse connections with What we've experienced then, is that new requests after a long idle time (probably >30min) frequently generate a timeout on next request. But connections with frequent requests are doing fine. Here is an example of such a request: ElasticSearch takes 26ms to serve request, but observed latency is >20seconds. Note that in this example, HypothesisI've found this issue that seemed related at first sight: it explains that if cluster nodes configuration by default on linux servers is to send first keepalive probe after 2hours (which is much to high). The idea was that maybe some connections after some time (maybe 1hour) are lost (for any reason, network or maybe misconfiguration with our haproxy), but not noticed by client since first keep-alive probe is sent after 2hours (by default). So we modified our Elasticsearch configuration with following parameters:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings but we still encounter these issues, so this might be something else. I'm not familiar with how TLS keepalive works under the hood, and still after reading https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html and reading elasticsearch python client documentation, and even urllib3 I'm still not sure to understand fully how disconnections should be handled by client. Can the client distinguish a disconnection timeout from an "applicative" timeout (ES takes a long time to treat request)? @jabonte snippetIf this can help, I've reformatted @jabonte code, but in my case I couldn't reproduce the issue (always had a single connection)
I ping @sethmlarson since he seems to be the perfect person as maintainer of both urllib3 and elasticsearch python libraries 🙂 |
So the way urllib3 handles this without requiring a background process that checks for connection aliveness is:
Timeouts describe to TCP timeouts, not anything to do with the HTTP/application level (this can be confusing) so it's the timeout to receive any data on the socket before giving up. The default socket options for the Urllib3HttpConnection don't set any TCP keepalive socket options, but they could. Might be something to investigate to get better behavior on long-lived sockets. |
Thank you for this fast answer and clarifications. For timeouts, I think better understand, still, shouldn't the query parameter For
I'm trying to figure out what would be the causes of failure in our case. I guess in most cases socket is closed "gracefully" (like server sends signal to close connection), in such case we don't try to perform request on old connection, and a new connection is opened isn't it? |
For socket options, reading the code, I think the following will do the trick (even if not "beautiful"):
I'll go with this and see if problem occurs again. EDIT: this worked |
I'm closing this in favor of elastic/elastic-transport-python#36 as the fix would now go there instead of this repository. |
We've been hitting issues with random requests timing out. It appears this is a problem with the connection pooling, because we've been able to mitigate this issue with this fix:
readthedocs/readthedocs.org#5760
The symptoms we are seeing is random timeouts on any request (indexing, searches, deletes, etc.) that seem to happen randomly and across servers. All our testing showed nothing being wrong, so we've applied the above fix to production, and it has seemed to solve the problem.
A couple things:
The text was updated successfully, but these errors were encountered: