Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

rediscluster.cluster_failover() does not connect to the proper node #360

Open
psd314 opened this issue Mar 12, 2020 · 1 comment
Open

rediscluster.cluster_failover() does not connect to the proper node #360

psd314 opened this issue Mar 12, 2020 · 1 comment
Labels
3.0.0 All issues that will be looked at for 3.0.0 release Accepted Accepted bug or enhancement Accepting PR type: bug

Comments

@psd314
Copy link

psd314 commented Mar 12, 2020

Description

I have a 9 node cluster with 3 masters, each master on its own machine. The node is cross-replicated so the 2 slaves on each machine are replicas of the other 2 non-local masters. I'm working on re-balancing the cluster by manually failing over a master after a machine has gone down and comes back up where two masters end up on one machine.

import rediscluster

REDIS_AUTH_PASS='authPassword'

startup_nodes = [{'host': 'XXX.XX.43.124', 'port': 6379}]
rc = rediscluster.RedisCluster(
            startup_nodes=startup_nodes,
            decode_responses=True,
            password=REDIS_AUTH_PASS)
node_id = 'slave_node_id_to_failover'
rc.cluster_failover(node_id, 'FORCE')

Expected

That the targeted slave node should be promoted to master and the associated master be demoted to a slave. I can do this successfully on this cluster via redis-cli -h <host> -p <port> -a <password> CLUSTER FAILOVER FORCE.

Actual

Traceback (most recent call last):
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/bin/ark", line 11, in <module>
    load_entry_point('arkcli', 'console_scripts', 'ark')()
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
    super().invoke(ctx)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
    super().invoke(ctx)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/redis.py", line 121, in rebalance
    rc.cluster_failover(node['id'], 'FORCE')
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 539, in cluster_failover
    return self.execute_command('CLUSTER FAILOVER', Token(option))
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/utils.py", line 101, in inner
    return func(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 410, in execute_command
    return self.parse_response(r, command, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/client.py", line 768, in parse_response
    response = connection.read_response()
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/connection.py", line 638, in read_response
    raise response
redis.exceptions.ResponseError: You should send CLUSTER FAILOVER to a replica

Bug

The node_id argument in the cluster_failover method does not get passed on to execute_command and is never used to identify the proper node to connect to. From rediscluster/client.py:

    def cluster_failover(self, node_id, option):
        """
        Forces a slave to perform a manual failover of its master

        Sends to specefied node
        """
        assert option.upper() in ('FORCE', 'TAKEOVER')  # TODO: change this option handling
        return self.execute_command('CLUSTER FAILOVER', Token(option))

It looks like the option argument gets used as key to identify the node via slot lookup and the master node for that slot gets returned. Also from rediscluster/client.py:

    @clusterdown_wrapper
    def execute_command(self, *args, **kwargs):
        """
        Send a command to a node in the cluster
        """
        if not args:
            raise RedisClusterException("Unable to determine command to use")

        command = args[0]

        # If set externally we must update it before calling any commands
        if self.refresh_table_asap:
            self.connection_pool.nodes.initialize()
            self.refresh_table_asap = False

        node = self.determine_node(*args, **kwargs)
        if node:
            return self._execute_command_on_nodes(node, *args, **kwargs)

        redirect_addr = None
        asking = False
        is_read_replica = False

        try_random_node = False

arg[1], the option arg in cluster_failover, gets used to determine slot in self._determine_slot(*args)

        slot = self._determine_slot(*args)
        ttl = int(self.RedisClusterRequestTTL)

        while ttl > 0:
            ttl -= 1

            if asking:
                node = self.connection_pool.nodes.nodes[redirect_addr]
                r = self.connection_pool.get_connection_by_node(node)
            elif try_random_node:
                r = self.connection_pool.get_random_connection()
                try_random_node = False
            else:
                if self.refresh_table_asap:
                    # MOVED
                    node = self.connection_pool.get_master_node_by_slot(slot)
                else:

slot gets used to look up node to failover, which returns the master node where the slot is located rather than the node_id passed to cluster_failover

                    node = self.connection_pool.get_node_by_slot(slot, self.read_from_replicas and (command in self.READ_COMMANDS))
                    is_read_replica = node['server_type'] == 'slave'
                r = self.connection_pool.get_connection_by_node(node)

I was able to work around this and get the slave to initiate the failover by finding the target node in rc.connection_pool.nodes.all_nodes() and use it to establish the connection to that node and send the CLUSTER FAILOVER command.

>>> node = {'host': 'XXX.XX.43.124', 'port': 6379, 'name': 'XXX.XX.43.124:6379', 'server_type': 'slave'}
>>> connection = rc.connection_pool.get_connection_by_node(node)
>>> connection.send_command('CLUSTER FAILOVER', 'FORCE')
>>> resp = rc.parse_response(connection, 'CLUSTER FAILOVER')
>>> resp
True

Tasks

Allow cluster_failover to target a slave node by id to start a manual failover.

@Grokzen
Copy link
Owner

Grokzen commented May 21, 2020

@psd314 I have been looking over this issue and dug into the code a fair bit now to see what the issue is and what i can do to resolve it. What i decided for now is to put this on the backburner for a bit and take a look at it again in the future. My current thinking about all of these cluster management and configuration commands is that i really dont want them to be inside the main RedisCluster class, but to have them outside the client class in a special management class that you would have to instantiate and use/call separate from the regular RedisCluster class. I have some ideas on how to make this work much better by providing much better helper methods and tools to manipulate the cluster when doing it in a separate class where loops and iterators would help to administrate a cluster much better.

For anyone else finidng this issue, right now my suggestion is that you use a plain Redis client instance and connect to your nodes and run the administration of your cluster through that until i can rebuild this feature to something that works the way i want it to.

@Grokzen Grokzen added Accepted Accepted bug or enhancement Accepting PR type: bug labels May 22, 2020
@Grokzen Grokzen added the 3.0.0 All issues that will be looked at for 3.0.0 release label Sep 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3.0.0 All issues that will be looked at for 3.0.0 release Accepted Accepted bug or enhancement Accepting PR type: bug
Projects
None yet
Development

No branches or pull requests

2 participants