Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with connection pooling #966

Closed
ericholscher opened this issue Jun 3, 2019 · 8 comments
Closed

Issue with connection pooling #966

ericholscher opened this issue Jun 3, 2019 · 8 comments

Comments

@ericholscher
Copy link

We've been hitting issues with random requests timing out. It appears this is a problem with the connection pooling, because we've been able to mitigate this issue with this fix:

readthedocs/readthedocs.org#5760

The symptoms we are seeing is random timeouts on any request (indexing, searches, deletes, etc.) that seem to happen randomly and across servers. All our testing showed nothing being wrong, so we've applied the above fix to production, and it has seemed to solve the problem.

A couple things:

  1. I don't know how to properly report or reproduce this bug, so let me know how I can be more helpful
  2. What is the best way to turn off connection pooling globally for now?
@honzakral
Copy link
Contributor

You can always pass in your custom headers (specify headers={'connection': 'close'} as part of the client options) and specify maxsize=1 to avoid creating multiple connections.

Other than that I have no good idea how to debug/reproduce this either as I would assume it is heavily dependent on the environment. Either way we are using HTTPSConnectionPool from urllib3 pretty much directly with no interference with the networking logic so that would definitely be a place to start. Either way I am transferring this issue to to elasticsearch-py as that is the library that holds the connection.

@honzakral honzakral transferred this issue from elastic/elasticsearch-dsl-py Jun 3, 2019
@jabonte
Copy link

jabonte commented Jul 27, 2020

I'm experiencing issues with connection pooling.
The connections do not seem to drop (can be verified with 'lsof')
At some point there where several thousand TCP sessions being reported to the ES cluster

elasticsearch == 7.8.0
urllib3 == 1.24.2

The issue seems to occur when the ES connection is idle for more then 4 seconds
For some reason using ES (ping) seems to resolve the issue.

The following script helps in showing the problem.
(formatting is regretably lost)

#!/usr/bin/python3
import os
import ssl
import subprocess
import asyncio
import argparse
import urllib3
from elasticsearch import AsyncElasticsearch
from elasticsearch.connection import create_ssl_context

def init_es(user, password, hosts):
'''Create the eslastic search connection'''
context = create_ssl_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
client = AsyncElasticsearch(
hosts,
http_auth=(user, password),
scheme="https",
ssl_context=context,
maxsize=1,
)
return client

def output_lsof(count, count_max, hosts):
'''see function name'''
pid = os.getpid()
lsof = subprocess.check_output(
['lsof', '-P', '-p', str(pid)],
stderr=subprocess.STDOUT
)
count_lsof = 0
for line in lsof.splitlines():
for host in hosts:
if host in str(line):
count_lsof += 1
print(line)
print("PID %s (%s/%s) has %s connections" % (
pid, count, count_max, count_lsof))

async def main(args):
'''main loop'''
user =
password =
hosts = [<ES server 1>, <ES server 2>, <ES server 3>,]
es_client = init_es(user, password, hosts)
count=args.count
sleep=args.sleep
ping = False
if args.ping:
ping = True
print("Count = %s"%count)
print("Sleep = %s"%sleep)
print("Ping = %s"%ping)
for cnt1 in range(count):
indexes_open = (await es_client.indices.get_settings(index='*', expand_wildcards='open')).keys()
output_lsof(cnt1, count, hosts)
for cnt2 in range(sleep):
if ping:
await es_client.ping()
await asyncio.sleep(1)
await es_client.transport.close()

if name == 'main':
parser = argparse.ArgumentParser(description="Test connection pooling")
parser.add_argument('count', type=int, help="how many times the 'indices' are queried")
parser.add_argument('sleep', type=int, help="sleep-time between queries")
parser.add_argument('--ping', type=bool, help="force keep-alive 'ping'")
args = parser.parse_args()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(args))

when running with a 'sleep' of 1 second, for 10 tries (no ping) there is no issue
test_connection_pooling_issue.py 10 1

when running with a 'sleep' of 5 seconds, for 10 tries (no ping) I'm seeing increasing connection counts
test_connection_pooling_issue.py 10 5

when running with the same test again, but with 'ping' enabled, no issue
test_connection_pooling_issue.py 10 5 --ping Y

Hope this info helps in determining the issue.

@najitaleb
Copy link

Hi I am still having this issue when I try to connect to my mongodb database using mongo connector.

In the middle of indexing, I will receive Errno 111 connection refused, then the rest of my database is indexed, but there are many entries in that timeframe that are completely ignored. Is there a fix for these random timeouts that I could apply to elasticsearch-py version 5.0? That is the latest version that is compatible with mongo-connector. Thanks.

@alk-lbinet
Copy link

alk-lbinet commented Apr 13, 2021

Hi, I've been encountering similar issues.

TLDR: frequently used connections work fine, but processes with long-lived connections generate retries after timeout.

Configuration

Elasticsearch version: 7.12.0
Elasticsearch python client version 7.12.0

Our topology the following: haproxy load balancer -> 2 client nodes -> (3 master / 4 data nodes).

Symptoms

We initially decided to recreate a connection each time (following @honzakral solution), but this has an impact on latencies and we cannot afford it anymore (higher latencies when there is a high rate of connections creations).

So we're trying to reuse connections with keepalive.

What we've experienced then, is that new requests after a long idle time (probably >30min) frequently generate a timeout on next request. But connections with frequent requests are doing fine.

Here is an example of such a request: ElasticSearch takes 26ms to serve request, but observed latency is >20seconds.
latency.

Note that in this example, Elasticsearch client was instanciated with a retry_on_timeout=True, timeout of 20 seconds. But query was performed with a timeout of 1 second (I would have expected timeout of query (1s) to override client timeout (20s), not sure if this is expected?).

Hypothesis

I've found this issue that seemed related at first sight: it explains that if cluster nodes configuration by default on linux servers is to send first keepalive probe after 2hours (which is much to high).
elastic/elasticsearch#65213

The idea was that maybe some connections after some time (maybe 1hour) are lost (for any reason, network or maybe misconfiguration with our haproxy), but not noticed by client since first keep-alive probe is sent after 2hours (by default).

So we modified our Elasticsearch configuration with following parameters:

network.tcp.keep_idle: 60
network.tcp.keep_interval: 60
network.tcp.keep_count: 3

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings

but we still encounter these issues, so this might be something else.

I'm not familiar with how TLS keepalive works under the hood, and still after reading https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html and reading elasticsearch python client documentation, and even urllib3 I'm still not sure to understand fully how disconnections should be handled by client. Can the client distinguish a disconnection timeout from an "applicative" timeout (ES takes a long time to treat request)?

@jabonte snippet

If this can help, I've reformatted @jabonte code, but in my case I couldn't reproduce the issue (always had a single connection)

#!/usr/bin/python3

import os
import subprocess
import asyncio
import argparse
from elasticsearch import AsyncElasticsearch

USER = "xxx"
PASSWORD = "xxx"
HOSTS = ['xxx']

def output_lsof(count, count_max, hosts):
    pid = os.getpid()
    lsof = subprocess.check_output(
        ['lsof', '-P', '-p', str(pid)],
        stderr=subprocess.STDOUT
    )
    count_lsof = 0
    for line in lsof.splitlines():
        for host in hosts:
            if host in str(line):
                count_lsof += 1
                print(line)
    print("PID %s has %s connection(s) (call %s/%s)" % (pid, count_lsof, count + 1, count_max))


async def main(count, sleep, ping):
    es_client = AsyncElasticsearch(HOSTS, http_auth=(USER, PASSWORD), maxsize=1)
    for request_count in range(count):
        await es_client.indices.get_settings(index='validation_*', expand_wildcards='open')
        output_lsof(request_count, count, HOSTS)
        for _ in range(sleep):
            if ping:
                await es_client.ping()
            await asyncio.sleep(1)
    await es_client.transport.close()

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description="Test connection pooling")
    parser.add_argument('count', type=int, help="how many times the 'indices' are queried", default=10)
    parser.add_argument('sleep', type=int, help="sleep-time between queries", default=5)
    parser.add_argument('--ping', type=bool, help="force keep-alive 'ping'", default=False)
    args = parser.parse_args()
    count = args.count
    sleep = args.sleep
    ping = args.ping
    print("Count = %s" % count)
    print("Sleep = %s" % sleep)
    print("Ping = %s" % ping)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(count, sleep, ping))

I ping @sethmlarson since he seems to be the perfect person as maintainer of both urllib3 and elasticsearch python libraries 🙂

@sethmlarson
Copy link
Contributor

So the way urllib3 handles this without requiring a background process that checks for connection aliveness is:

  • Keep a pool of sockets that were opened
  • When a new request comes in, pull a socket and try to send a request, if it fails to send data it's because the socket is bad.
  • Keep pulling sockets from the pool until one that's usable comes out, if the pool is empty then we create a new connection

Timeouts describe to TCP timeouts, not anything to do with the HTTP/application level (this can be confusing) so it's the timeout to receive any data on the socket before giving up.

The default socket options for the Urllib3HttpConnection don't set any TCP keepalive socket options, but they could. Might be something to investigate to get better behavior on long-lived sockets.

@alk-lbinet
Copy link

alk-lbinet commented Apr 14, 2021

Thank you for this fast answer and clarifications.

For timeouts, I think better understand, still, shouldn't the query parameter timeout specified when performing a search timeout (cf timeout here) provided at search (elasticsearch.client.Elasticsearch.search) override the timeout provided at client instantiation (elasticsearch.client.Elasticsearch.__init__)?
In my case it seems to me that the 20s timeout (init client) was used, whereas I expected my query to timeout after 1s (search timeout).

For Urllib3HttpConnection, is there currently a way to specify socket options (so that I can explicitly ask for keep-alive), or would it be a feature request?

When a new request comes in, pull a socket and try to send a request, if it fails to send data it's because the socket is bad.

I'm trying to figure out what would be the causes of failure in our case. I guess in most cases socket is closed "gracefully" (like server sends signal to close connection), in such case we don't try to perform request on old connection, and a new connection is opened isn't it?

@alk-lbinet
Copy link

alk-lbinet commented Apr 14, 2021

For socket options, reading the code, I think the following will do the trick (even if not "beautiful"):

from urllib3.connection import HTTPConnection
from elasticsearch import Elasticsearch


class KeepAliveConnection(Urllib3HttpConnection):

    def __init__(self, *args, **kwargs):
        super(KeepAliveConnection, self).__init__(*args, **kwargs)
        self.pool.conn_kw['socket_options'] = HTTPConnection.default_socket_options + [
            (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
            (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
            (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 30),
            (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3),
        ]

client = Elasticsearch(
        < other *args and **kwargs>,
        connection_class=KeepAliveConnection
)

I'll go with this and see if problem occurs again.

EDIT: this worked

@sethmlarson
Copy link
Contributor

I'm closing this in favor of elastic/elastic-transport-python#36 as the fix would now go there instead of this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants