Issue with connection pooling #966

ericholscher · 2019-06-03T17:37:04Z

We've been hitting issues with random requests timing out. It appears this is a problem with the connection pooling, because we've been able to mitigate this issue with this fix:

readthedocs/readthedocs.org#5760

The symptoms we are seeing is random timeouts on any request (indexing, searches, deletes, etc.) that seem to happen randomly and across servers. All our testing showed nothing being wrong, so we've applied the above fix to production, and it has seemed to solve the problem.

A couple things:

I don't know how to properly report or reproduce this bug, so let me know how I can be more helpful
What is the best way to turn off connection pooling globally for now?

honzakral · 2019-06-03T19:44:51Z

You can always pass in your custom headers (specify headers={'connection': 'close'} as part of the client options) and specify maxsize=1 to avoid creating multiple connections.

Other than that I have no good idea how to debug/reproduce this either as I would assume it is heavily dependent on the environment. Either way we are using HTTPSConnectionPool from urllib3 pretty much directly with no interference with the networking logic so that would definitely be a place to start. Either way I am transferring this issue to to elasticsearch-py as that is the library that holds the connection.

jabonte · 2020-07-27T09:36:21Z

I'm experiencing issues with connection pooling.
The connections do not seem to drop (can be verified with 'lsof')
At some point there where several thousand TCP sessions being reported to the ES cluster

elasticsearch == 7.8.0
urllib3 == 1.24.2

The issue seems to occur when the ES connection is idle for more then 4 seconds
For some reason using ES (ping) seems to resolve the issue.

The following script helps in showing the problem.
(formatting is regretably lost)

#!/usr/bin/python3
import os
import ssl
import subprocess
import asyncio
import argparse
import urllib3
from elasticsearch import AsyncElasticsearch
from elasticsearch.connection import create_ssl_context

def init_es(user, password, hosts):
'''Create the eslastic search connection'''
context = create_ssl_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
client = AsyncElasticsearch(
hosts,
http_auth=(user, password),
scheme="https",
ssl_context=context,
maxsize=1,
)
return client

def output_lsof(count, count_max, hosts):
'''see function name'''
pid = os.getpid()
lsof = subprocess.check_output(
['lsof', '-P', '-p', str(pid)],
stderr=subprocess.STDOUT
)
count_lsof = 0
for line in lsof.splitlines():
for host in hosts:
if host in str(line):
count_lsof += 1
print(line)
print("PID %s (%s/%s) has %s connections" % (
pid, count, count_max, count_lsof))

async def main(args):
'''main loop'''
user =
password =
hosts = [<ES server 1>, <ES server 2>, <ES server 3>,]
es_client = init_es(user, password, hosts)
count=args.count
sleep=args.sleep
ping = False
if args.ping:
ping = True
print("Count = %s"%count)
print("Sleep = %s"%sleep)
print("Ping = %s"%ping)
for cnt1 in range(count):
indexes_open = (await es_client.indices.get_settings(index='*', expand_wildcards='open')).keys()
output_lsof(cnt1, count, hosts)
for cnt2 in range(sleep):
if ping:
await es_client.ping()
await asyncio.sleep(1)
await es_client.transport.close()

if name == 'main':
parser = argparse.ArgumentParser(description="Test connection pooling")
parser.add_argument('count', type=int, help="how many times the 'indices' are queried")
parser.add_argument('sleep', type=int, help="sleep-time between queries")
parser.add_argument('--ping', type=bool, help="force keep-alive 'ping'")
args = parser.parse_args()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(args))

when running with a 'sleep' of 1 second, for 10 tries (no ping) there is no issue
test_connection_pooling_issue.py 10 1

when running with a 'sleep' of 5 seconds, for 10 tries (no ping) I'm seeing increasing connection counts
test_connection_pooling_issue.py 10 5

when running with the same test again, but with 'ping' enabled, no issue
test_connection_pooling_issue.py 10 5 --ping Y

Hope this info helps in determining the issue.

najitaleb · 2020-10-29T17:03:40Z

Hi I am still having this issue when I try to connect to my mongodb database using mongo connector.

In the middle of indexing, I will receive Errno 111 connection refused, then the rest of my database is indexed, but there are many entries in that timeframe that are completely ignored. Is there a fix for these random timeouts that I could apply to elasticsearch-py version 5.0? That is the latest version that is compatible with mongo-connector. Thanks.

alk-lbinet · 2021-04-13T10:07:26Z

Hi, I've been encountering similar issues.

TLDR: frequently used connections work fine, but processes with long-lived connections generate retries after timeout.

Configuration

Elasticsearch version: 7.12.0
Elasticsearch python client version 7.12.0

Our topology the following: haproxy load balancer -> 2 client nodes -> (3 master / 4 data nodes).

Symptoms

We initially decided to recreate a connection each time (following @honzakral solution), but this has an impact on latencies and we cannot afford it anymore (higher latencies when there is a high rate of connections creations).

So we're trying to reuse connections with keepalive.

What we've experienced then, is that new requests after a long idle time (probably >30min) frequently generate a timeout on next request. But connections with frequent requests are doing fine.

Here is an example of such a request: ElasticSearch takes 26ms to serve request, but observed latency is >20seconds.
.

Note that in this example, Elasticsearch client was instanciated with a retry_on_timeout=True, timeout of 20 seconds. But query was performed with a timeout of 1 second (I would have expected timeout of query (1s) to override client timeout (20s), not sure if this is expected?).

Hypothesis

I've found this issue that seemed related at first sight: it explains that if cluster nodes configuration by default on linux servers is to send first keepalive probe after 2hours (which is much to high).
elastic/elasticsearch#65213

The idea was that maybe some connections after some time (maybe 1hour) are lost (for any reason, network or maybe misconfiguration with our haproxy), but not noticed by client since first keep-alive probe is sent after 2hours (by default).

So we modified our Elasticsearch configuration with following parameters:

network.tcp.keep_idle: 60
network.tcp.keep_interval: 60
network.tcp.keep_count: 3

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings

but we still encounter these issues, so this might be something else.

I'm not familiar with how TLS keepalive works under the hood, and still after reading https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html and reading elasticsearch python client documentation, and even urllib3 I'm still not sure to understand fully how disconnections should be handled by client. Can the client distinguish a disconnection timeout from an "applicative" timeout (ES takes a long time to treat request)?

@jabonte snippet

If this can help, I've reformatted @jabonte code, but in my case I couldn't reproduce the issue (always had a single connection)

#!/usr/bin/python3

import os
import subprocess
import asyncio
import argparse
from elasticsearch import AsyncElasticsearch

USER = "xxx"
PASSWORD = "xxx"
HOSTS = ['xxx']

def output_lsof(count, count_max, hosts):
    pid = os.getpid()
    lsof = subprocess.check_output(
        ['lsof', '-P', '-p', str(pid)],
        stderr=subprocess.STDOUT
    )
    count_lsof = 0
    for line in lsof.splitlines():
        for host in hosts:
            if host in str(line):
                count_lsof += 1
                print(line)
    print("PID %s has %s connection(s) (call %s/%s)" % (pid, count_lsof, count + 1, count_max))


async def main(count, sleep, ping):
    es_client = AsyncElasticsearch(HOSTS, http_auth=(USER, PASSWORD), maxsize=1)
    for request_count in range(count):
        await es_client.indices.get_settings(index='validation_*', expand_wildcards='open')
        output_lsof(request_count, count, HOSTS)
        for _ in range(sleep):
            if ping:
                await es_client.ping()
            await asyncio.sleep(1)
    await es_client.transport.close()

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description="Test connection pooling")
    parser.add_argument('count', type=int, help="how many times the 'indices' are queried", default=10)
    parser.add_argument('sleep', type=int, help="sleep-time between queries", default=5)
    parser.add_argument('--ping', type=bool, help="force keep-alive 'ping'", default=False)
    args = parser.parse_args()
    count = args.count
    sleep = args.sleep
    ping = args.ping
    print("Count = %s" % count)
    print("Sleep = %s" % sleep)
    print("Ping = %s" % ping)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(count, sleep, ping))

I ping @sethmlarson since he seems to be the perfect person as maintainer of both urllib3 and elasticsearch python libraries 🙂

sethmlarson · 2021-04-13T16:13:05Z

So the way urllib3 handles this without requiring a background process that checks for connection aliveness is:

Keep a pool of sockets that were opened
When a new request comes in, pull a socket and try to send a request, if it fails to send data it's because the socket is bad.
Keep pulling sockets from the pool until one that's usable comes out, if the pool is empty then we create a new connection

Timeouts describe to TCP timeouts, not anything to do with the HTTP/application level (this can be confusing) so it's the timeout to receive any data on the socket before giving up.

The default socket options for the Urllib3HttpConnection don't set any TCP keepalive socket options, but they could. Might be something to investigate to get better behavior on long-lived sockets.

alk-lbinet · 2021-04-14T07:53:04Z

Thank you for this fast answer and clarifications.

For timeouts, I think better understand, still, shouldn't the query parameter timeout specified when performing a search timeout (cf timeout here) provided at search (elasticsearch.client.Elasticsearch.search) override the timeout provided at client instantiation (elasticsearch.client.Elasticsearch.__init__)?
In my case it seems to me that the 20s timeout (init client) was used, whereas I expected my query to timeout after 1s (search timeout).

For Urllib3HttpConnection, is there currently a way to specify socket options (so that I can explicitly ask for keep-alive), or would it be a feature request?

When a new request comes in, pull a socket and try to send a request, if it fails to send data it's because the socket is bad.

I'm trying to figure out what would be the causes of failure in our case. I guess in most cases socket is closed "gracefully" (like server sends signal to close connection), in such case we don't try to perform request on old connection, and a new connection is opened isn't it?

alk-lbinet · 2021-04-14T20:51:34Z

For socket options, reading the code, I think the following will do the trick (even if not "beautiful"):

from urllib3.connection import HTTPConnection
from elasticsearch import Elasticsearch


class KeepAliveConnection(Urllib3HttpConnection):

    def __init__(self, *args, **kwargs):
        super(KeepAliveConnection, self).__init__(*args, **kwargs)
        self.pool.conn_kw['socket_options'] = HTTPConnection.default_socket_options + [
            (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
            (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
            (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 30),
            (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3),
        ]

client = Elasticsearch(
        < other *args and **kwargs>,
        connection_class=KeepAliveConnection
)

I'll go with this and see if problem occurs again.

EDIT: this worked

sethmlarson · 2021-10-20T12:29:00Z

I'm closing this in favor of elastic/elastic-transport-python#36 as the fix would now go there instead of this repository.

honzakral transferred this issue from elastic/elasticsearch-dsl-py Jun 3, 2019

sethmlarson mentioned this issue Oct 20, 2021

Set TCP_KEEPALIVE sockets options for better long-lived connection behavior elastic/elastic-transport-python#36

Open

sethmlarson closed this as completed Oct 20, 2021

jarosenb mentioned this issue Nov 14, 2023

Configuration: Use TCP keepalive for Elasticsearch connection DesignSafe-CI/portal#1127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with connection pooling #966

Issue with connection pooling #966

ericholscher commented Jun 3, 2019

honzakral commented Jun 3, 2019

jabonte commented Jul 27, 2020

najitaleb commented Oct 29, 2020

alk-lbinet commented Apr 13, 2021 •

edited

Loading

sethmlarson commented Apr 13, 2021

alk-lbinet commented Apr 14, 2021 •

edited

Loading

alk-lbinet commented Apr 14, 2021 •

edited

Loading

sethmlarson commented Oct 20, 2021

Issue with connection pooling #966

Issue with connection pooling #966

Comments

ericholscher commented Jun 3, 2019

honzakral commented Jun 3, 2019

jabonte commented Jul 27, 2020

najitaleb commented Oct 29, 2020

alk-lbinet commented Apr 13, 2021 • edited Loading

Configuration

Symptoms

Hypothesis

@jabonte snippet

sethmlarson commented Apr 13, 2021

alk-lbinet commented Apr 14, 2021 • edited Loading

alk-lbinet commented Apr 14, 2021 • edited Loading

sethmlarson commented Oct 20, 2021

alk-lbinet commented Apr 13, 2021 •

edited

Loading

alk-lbinet commented Apr 14, 2021 •

edited

Loading

alk-lbinet commented Apr 14, 2021 •

edited

Loading