Cluster without a VIP #210

unixwitch · 2017-07-13T10:35:32Z

unixwitch
Jul 13, 2017

Hello,

Is it possible to use PAF without a virtual IP address - for example, to have recovery.conf generated based on what the current master hostname or IP address is?

Our use-case for this is running on Google Cloud Engine, where there is no layer 2 networking between hosts; getting a functional virtual IP is quite complex (it requires using the GCE API to move the address between hosts) and also error prone (sometimes the API seems to fail). Since we don't need a virtual IP address for clients, it would be nice if we could avoid it for PAF as well.

blogh · 2017-07-13T12:22:07Z

blogh
Jul 13, 2017
Collaborator

Hi,

On a two node cluster, you could have a different recovery.conf.pcmk on each node. Each file pointing to the other node. (I never tested this, but I think it will work)
If you have more than two nodes. I don't see an easy way.

As far as I understand the design behind PAF, I don't think that modifying PAF to edit the recovey.conf file will happen anytime soon. The idea being to use existing infrastructure to keep the agent simple. @ioguix is more involved in this project than me. Let's see what he thinks.

Benoit

0 replies

blogh · 2017-07-13T12:28:00Z

blogh
Jul 13, 2017
Collaborator

Thinking a little more about that 2 nodes cluster thing ... I am not sure of what would happened when the resource is started (they will be started as standbies with valid targets whereas with a normal setup there is no VIP so no valid target)

I have to try this.

0 replies

Markus- · 2017-07-14T05:06:37Z

Markus-
Jul 14, 2017

You probably can script something with local name resolution (/etc/hosts).

How do you want so solve the client side? With DNS?
Markus

0 replies

ioguix · 2017-07-17T10:03:21Z

ioguix
Jul 17, 2017
Maintainer

As far as I understand the design behind PAF, I don't think that modifying PAF to edit the recovey.conf file will happen anytime soon

Indeed. We have some more work to do on other subjects before adding some more complexity in the code base.

Thinking a little more about that 2 nodes cluster thing ... I am not sure of what would happened when the resource is started (they will be started as standbies with valid targets whereas with a normal setup there is no VIP so no valid target)

Indeed, this would created a "loop" in the replication between each slaves. However, I'm not sure what would happen. This has to be tested, at least to know what happen. But it feels wrong to me anyway, whatever the result of the test.

Maybe we could draw something with the portblock resource agent to blackhole the postgresql port, but it seems tricky as well.

Another way to escape would be to provide the master nodename of the resource as a cluster attribute after the promotion so seomthing can catch it a create appropriate iptables rules, or any other kind of action...This could be discussed for 2.3 I guess.

0 replies

unixwitch · 2017-07-17T10:23:26Z

unixwitch
Jul 17, 2017
Author

How do you want so solve the client side? With DNS?

The client runs its own haproxy, with a trivial HTTP server on each Postgres server to do healthchecking and determine the current master. The client is actually a Kubernetes cluster, so there is one redundant copy of haproxy for all applications in the cluster.

This is the opposite way to how we usually do it, with haproxy managed by Pacemaker, but the limitations of GCE's networking means it works better. (With MySQL/MariaDB Galera clusters, we can eliminate the VIP entirely, which means no need to interact with GCE to make routing changes.)

Unfortunately the two node solutions won't work for us since we have three nodes.

I'm not sure about automatic iptables rules: if adding the rule failed for some reason, then it could end up with two Postgres servers replicating from each other, right? Maybe PAF just isn't suitable for this configuration. (In which case I will see if we can make the VIP work better, perhaps with a network overlay like Calico...)

0 replies

Markus- · 2017-07-17T10:47:15Z

Markus-
Jul 17, 2017

If you already have 3 Nodes go after Zalandos patroni - It handles all the stuff for you and seems to be the perfect solution. https://github.com/zalando/patroni

0 replies

ioguix · 2017-07-17T11:56:46Z

ioguix
Jul 17, 2017
Maintainer

+1 for patroni. It seems to fit perfectly for container-based clusters (the devs couldn't tell me if this could run out of a containers during pgconf.eu 2016).

It will not cover all the failure scenario Pacemaker can cover though, but I would bet it fills 80-90% of the most frequent of them.

0 replies

CyberDem0n · 2017-07-29T08:01:00Z

CyberDem0n
Jul 29, 2017

the devs couldn't tell me if this could run out of a containers during pgconf.eu 2016

It perfectly works on bare metal from the very beginning. And actually it's other way around, it was tricky to implement container support (run patroni wih pid=1), because it also must act as init process (handle unexpected SIGCHLDs).

It will not cover all the failure scenario Pacemaker can cover though, but I would bet it fills 80-90% of the most frequent of them.

I wonder, what isn't covered yet? 1.3 supports Linux Watchdog.

0 replies

ioguix · 2017-07-29T20:47:39Z

ioguix
Jul 29, 2017
Maintainer

It perfectly works on bare metal from the very beginning. And actually it's other way around, it was tricky to implement container support (run patroni wih pid=1), because it also must act as init process (handle unexpected SIGCHLDs).

Nice!

Really, you both guys were not on the same page during your talk, it was quite confusing to understand the answer. Moreover, I couldn't find anything about Patroni on bare metal. It is most often (always?) shown in a container-based architecture.

Anyway, good to hear it works great on bare metal, I finally have my answer :)

I wonder, what isn't covered yet? 1.3 supports Linux Watchdog.

I should admit I do not follow the Patroni dev cycle, I'm not aware of new features. Congrats for supporting Watchdog and for your new release 1.3.

My "feeling" about patroni covering 80-90% of failure scenarios comes from distinct (old) infos.

"Patroni: Your HA Patron Saint" conference is one of them when in introduction Josh says building a HA system fitting everyone needs is quite hard, but we can at least build one covering 80% of the solution.

Another point is the fact that Patroni relies on etcd to failover, without taking care of the old master status and fencing it before electing another standby as master. I suppose we can add some fencing in the on_role_change callback, but I'm not clear about when it is called and if it can abort the role change if the fencing fails.

But again, I might have misunderstood something or did not found the good doc here.

Cheers,

0 replies

CyberDem0n · 2017-07-30T06:20:36Z

CyberDem0n
Jul 30, 2017

Moreover, I couldn't find anything about Patroni on bare metal. It is most often (always?) shown in a container-based architecture.

Personally I prefer to show it in a containers. Container is not a magic, just think about it like about tiny physical machine. Although in the demo every container runs on a single laptop, they have own "isolated" processes, network, storage and memory, i.e. processes from one container can communicate with processes in another container only via the network. The word isolated is quoted because at the end all resources are shared, but it is still good enough for modeling.
Oh! Actually on one of the conferences Cybertec was demonstrating Patroni cluster on Raspberry Pi ;)

building a HA system fitting everyone needs is quite hard, but we can at least build one covering 80% of the solution.

Ok, now I've got it. Requirements to HA could be really different and it's clearly not possible to meet all of them. Somebody needs failure detection and failover in less than one second, somebody else wants to have zero data loss without synchronous replication. Obviously Patroni can't cover it and wont event target.

Patroni relies on etcd to failover, without taking care of the old master status and fencing it before electing another standby as master.

There could be different failure situations:

Node failure - obvious, you can't do anything with it
Postgres failure - Patroni will try to start it up.
Network partitioning
Patroni failure (death) on the master node

First lets talk about network partitioning:

Patroni runs on every node in the cluster.
On the master node Patroni periodically (every 10s, but it's configurable) updates leader key in etcd
If for some reason update of the leader key failed, Patroni wont give up, but retry update during 10s (also configurable).
If update of the leader key failed even after retry period, Patroni will restart postgres in read-only mode (create recovery.conf and restart). Does it look like fencing?
Leader key will expire from etcd after 30s. Basically leader election will happen only 10 seconds after former master was restarted read-only.

Patroni death on master node is more tricky and most interesting once, because postgres will continue to run as a master.
To solve it we have following options:

Watchdog (Linux only). If Patroni has died watchdog will restart machine before leader lock expired and leader election started. Does it look like fencing?
Run Patroni + postgres in the container. If Patroni runs with pid=1, death of Patroni will shutdown entire container (including postgres). Does it look like fencing?

What is not covered so far? The case when Patroni runs on bare metal, but without Watchdog.
Here we should look on different kinds of possible setups, i.e. how do we forward traffic to the master:

Cluster is behind load balancer (haproxy)
VIP assigned to the master node

Haproxy periodically send health-check probes to Patroni REST API (to all nodes), to figure out where the master runs. If Patroni is died on master, haproxy will exclude this node from load balancing. Basically it means master will stop receiving a traffic. Also haproxy will terminate all existing connections.

The only thing which is not really covered and could be potentially dangerous - Patroni + Postgres + VIP. Here I could only suggest to run Patroni with supervisor script, which will restart Patroni if it died.
When I am using a word "died" here, I only mean some possibility that it could die and it's not something that usually happens :)

0 replies

ioguix · 2017-10-05T09:57:37Z

ioguix
Oct 5, 2017
Maintainer

I'm not sure what to do with this issue.

Considering the question of @unixwitch, it seems to me the key point here is:

a standby need to be restarted if recovery.conf changes:
- this might be a nogo for some usecase
- PAF doesn't support it yet, but it doesn seems that difficult to implement
if we don't want the standby to be restarted we need to rely on an extern resource (that must be in HA) clients and standby can use to (re)connect with the master, eg.:
- a vIP
- haproxy

Does it sum up the question? How do you think PAF should support recovery.conf generation to eg. update the primary_conninfo (implying a standby restart on Pacemaker decision)?

0 replies

ioguix · 2018-03-09T15:27:09Z

ioguix
Mar 9, 2018
Maintainer

@CyberDem0n,

(sorry in advance for hijhacking this issues with a Patroni discussion).

I'm back to the subject as I spend some days playing with Patroni. I didn't dig too far, but at least I can answer your messages 7 month later :)

Most of your Does it look like fencing? are actually not fencing. Fencing means your node is fenced by the cluster itself. It doesn't mean the node take the decision itself. Fencing means another node needs decide to take over the master role, but have to kill the old master before doing so.

In other words, fencing means other nodes are killing the old master to put a real et confirmed state on a node which does not reply anymore, should it be because of network, load or angry unicorns. Here is an illustration: https://ourobengr.com/stonith-story/

I think Patroni is in a pretty good shape. I have to study it a little bit to understand if the fencing part could be integrated in the callback logic. The only concern I have is about failing fencing. Should the fencing fails, the callback must be able to cancel the election. I know this is counter intuitive, but this is needed to avoid split brain. It seems to me repmgr callbacks are able to cancel promotion. At least some of my colleagues already rely on this in some failing fencing situations.

@CyberDem0n, if you have some input or guidance, it would save me some time to get quickly to the goal with my limited bandwidth. Thanks in advance.

0 replies

CyberDem0n · 2018-03-10T12:37:06Z

CyberDem0n
Mar 10, 2018

Fencing means another node needs decide to take over the master role, but have to kill the old master before doing so.

Let's assume that we have 3 data centers in different locations, DC1, DC2, DC3 and master is running in DC1.
At some moment DC1 gets isolated from DC2 and DC3 (network partitioning). DC2 and DC3 still have a quorum and therefore can elect a new master, but unfortunately they can't access the master running in DC1 and kill it. How would you solve it?

What Patroni does in this case: it fails to update a leader lock in Etcd (because it is on losing side of partitioned network) and restarts postgres in read-only mode.

if the fencing part could be integrated in the callback logic.

One should be really careful with callbacks. They are executed asynchronously and if you are using them to move virtual ip or something similar there is a race condition: patroni/patroni#536 Although on practice we never experienced it.

Should the fencing fails, the callback must be able to cancel the election

It looks like I still don't really understand your definition of fencing. Is it about connecting to the old node in killing patroni/postgres? Is it about restarting the node with the help of ipmi or something similar? Or let's put it more general: is it about accessing the old node via some network protocol and doing some actions in order to make sure that postgres can't be accessed after that? And what should happen when the network is broken?

0 replies

ioguix · 2018-03-11T14:32:22Z

ioguix
Mar 11, 2018
Maintainer

At some moment DC1 gets isolated from DC2 and DC3 (network partitioning). DC2 and DC3 still have a quorum and therefore can elect a new master, but unfortunately they can't access the master running in DC1 and kill it. How would you solve it?

Wide cluster are different from local cluster, because of the network obviously. But let's consider it.

Everything must be redundant, even the WAN network between DC. If this network is not redundant, you have a SPoF. But anyway, if your network is 100% down then, you probably can not fence the isolated node (unless you have a GSM backup network or similar). You must rely on something similar to a quorum.

In Pacemaker galaxy, this is called "Cluster Ticket Registry (CTR)" and it looks like the Patroni algorithm (but is quite older). CTR can speed-up the shutdown process if needed using some fencing method.

See: http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#idm140583453600496

One should be really careful with callbacks. They are executed asynchronously and if you are using them to move virtual ip or something similar there is a race condition: patroni/patroni#536 Although on practice we never experienced it.

Argh, this is quite a problem. That means callbacks are not able to cancel a promotion, right? On a local cluster with only two nodes, fencing is mandatory. Not being able to confirm the fencing succeed, means you are not safe. If fencing failed, the promotion must be abort.

Using repmgr, we add fencing in the promotion script and abort the promotion if needed. This is critical as I consider repmgr safe only for two-nodes clusters...as far as the fencing is set-up.

As far as I understand it, Patroni sounds fine for 2 nodes clusters only thanks to the quorum policy on etcd side (which requires at least 3 nodes). But quorum is not a silver bullet either...

In short, watchdog != quorum != fencing. A proper HA setup should have the three of them.

It looks like I still don't really understand your definition of fencing. [...]

Fencing could be:

switching off the node using IPMI (node fencing)
switching off the node using manageable PDU (node fencing)
switching off the node using manageable UPS (node fencing)
isolating the node from the network using manageable switch (resource fencing)
isolating the node from its disk on the SAN (resource fencing)

Again, most of them means the network must be redundant, no SPoF. A cluster usually can not survive two distinct failures. This is a big rule.

However, note that PDU and UPS can often be managed from their serial port, bypassing the network altogether.
Moreover, some experts are just building their own fencing device. See: https://www.alteeve.com/w/Node_Assassin

As Pacemaker support multi-level fencing, we can rely on two fencing method if needed.

Back to Patroni. I really think you should consider a solution so admin can add some logic to abort promotion on some external logic. That would help to add some fencing logic in the process. Do you think this is something possible? Would you accept external contribution for such feature?

Thanks,

0 replies

CyberDem0n · 2018-03-13T15:16:37Z

CyberDem0n
Mar 13, 2018

That means callbacks are not able to cancel a promotion, right?

Correct. They are asynchronous, and executed more or less right after calling pg_ctl promote

On a local cluster with only two nodes, fencing is mandatory. Not being able to confirm the fencing succeed, means you are not safe.

If you are using some kind of load balancer it is not really necessary. Off course things change if you use VIP.

Patroni sounds fine for 2 nodes clusters only thanks to the quorum policy on etcd side (which requires at least 3 nodes)

Patroni solely relies on external quorum (etcd, zookeeper, consul), therefore running just two postgres nodes if fine.

Fencing could be:

All right, got it. Unfortunately in most cases those things are not really controlled by DBA :(

Back to Patroni. I really think you should consider a solution so admin can add some logic to abort promotion on some external logic. That would help to add some fencing logic in the process. Do you think this is something possible?

How it currently works:

All nodes in the cluster detect that there is no leader key in etcd
All nodes connect to each other via REST API (including former master, what if it still responds and runs as a master?) and compare xlog position with itself. If the node is seeing that some other node is already running as a master or ahead of current node - it will not try to grab a leader key in etcd.
If the node consider itself as the most ahead of other nodes it will try to grab the leader key
If it managed successfully grab the leader key in etcd, Patroni calls pg_ctl promote

Improvement sound fairly simple: call some external script which does fencing before calling pg_ctl promote, and if this script exits with non zero code remove leader key from etcd and basically give a chance to other nodes.

There is a one minor problem here. What if it takes too long to run such script?

Would you accept external contribution for such feature?

Sure.

0 replies

ioguix · 2018-03-13T15:36:30Z

ioguix
Mar 13, 2018
Maintainer

If you are using some kind of load balancer it is not really necessary. Off course things change if you use VIP.

Load balancer can fail, and they do not update the routings rules right during the role transition. This opens many windows where connexions can fail or be routed to the wrong node.

Patroni solely relies on external quorum (etcd, zookeeper, consul), therefore running just two postgres nodes if fine.

Yes, that was what I was stating as well. This is a really smart architecture.

All right, got it. Unfortunately in most cases those things are not really controlled by DBA :(

Indeed. This is not always a deal breaker in various situations.

There is a one minor problem here. What if it takes too long to run such script?

Depend on the productions constraints: service availability or data availability :)

In sort, either block/do nothing or pray and keep going with the promotion. We usually block.

Would you accept external contribution for such feature?

Sure.

I'll ask some colleagues who knows python if this is feasible on their side.

0 replies

YanChii · 2018-03-15T13:04:52Z

YanChii
Mar 15, 2018

Hi guys,
maybe this is a bit off-topic but I hope it might be useful for you.

Postgres 10 has enhanced connection handling so you can connect to the master without the need of a failover IP. You just specify a client connection string like this:
host=dbhost01,dbhost02,dbhost03 target_session_attrs=read-write
and it will cycle over the hosts until it finds the master.
https://wiki.postgresql.org/wiki/New_in_postgres_10#Connection_Failover_and_Routing_in_libpq
I've spent some time creating PG geo-clusters across multiple datacenters using PAF. I've patched PAF to be more robust and to rely solely on quorum during network partitions. The main additions to the voting logic are:
A) getting full node list on all nodes (not only the currently available ones)
B) cluster-wide tracking of the highest instance timeline number to prevent old demoted masters to errorneously become masters again (e.g. after reboot or during multiple net partitions)
https://github.com/ClusterLabs/PAF/compare/master...YanChii:master-geo-ha?expand=1

Jan

0 replies

ioguix · 2018-03-15T14:12:18Z

ioguix
Mar 15, 2018
Maintainer

Postgres 10 has enhanced connection

Yes, this is quite useful, indeed. You rely on client tcp setup though. TCP rety and timeout might delay to moment where the client will react to the switchover/failover.

I've spent some time creating PG geo-clusters across multiple datacenters using PAF

I already tried to implement A as well, but gave up because the node list format depend on the underlaying cluster architecture (CMAN vs coro 1.x vs. coro 2.x).

I'll have to check your code for B. Could be imported upstream I suppose.

@YanChii, did you check Pacemaker's CTR setup for geo clusters ?

0 replies

YanChii · 2018-03-15T14:35:37Z

YanChii
Mar 15, 2018

Yes, this is quite useful, indeed. You rely on client tcp setup though. TCP rety and timeout might delay to moment where the client will react to the switchover/failover.

If you make a lot of small connections, it can slow down your system (especially if there's a non-responding node). But I hope this will soon be solved by a pgbouncer or simillar projects.

I already tried to implement A as well, but gave up because the node list format depend on the underlaying cluster architecture (CMAN vs coro 1.x vs. coro 2.x).

I know. My change is working only on CentOS systems. But I think it's only a matter of one uname+if statement. We know already the output format of crm_node in CentOS and in Debian. The other OS-es can (for now) have the default output as it is implemented now.

I'll have to check your code for B. Could be imported upstream I suppose.

That would be cool. But you'll probably want it to enable by some config var (not enabled by default) because it creates an explicit CRM config entry (the only way I was able to make the info node-independent and persistent accross reboots).

did you check Pacemaker's CTR setup for geo clusters ?

I did. It is a very complicated setup. I would go for patroni instead :).

Jan

0 replies

blogh · 2018-03-15T15:01:21Z

blogh
Mar 15, 2018
Collaborator

Hi Jan, Good to see you again :)

host=dbhost01,dbhost02,dbhost03 target_session_attrs=read-write

I dont like the idea for our purpose since we don't control where the client will connect anymore. There is always the risk that someone starts a cluster in R/W outside of the pacemaker (human error or because reasons). In that case, we rely on the order of the hosts in the connection string to connect on the right host (or need to remember to modify the connection strings).

Of course if it's not possible to have a vip... It's better to have this than nothing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster without a VIP #210

{{title}}

Replies: 20 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Cluster without a VIP #210

Replies: 20 comments

blogh Jul 13, 2017 Collaborator

blogh Jul 13, 2017 Collaborator

ioguix Jul 17, 2017 Maintainer

unixwitch Jul 17, 2017 Author

ioguix Jul 17, 2017 Maintainer

ioguix Jul 29, 2017 Maintainer

ioguix Oct 5, 2017 Maintainer

ioguix Mar 9, 2018 Maintainer

ioguix Mar 11, 2018 Maintainer

ioguix Mar 13, 2018 Maintainer

ioguix Mar 15, 2018 Maintainer

blogh Mar 15, 2018 Collaborator

blogh
Jul 13, 2017
Collaborator

blogh
Jul 13, 2017
Collaborator

ioguix
Jul 17, 2017
Maintainer

unixwitch
Jul 17, 2017
Author

ioguix
Jul 17, 2017
Maintainer

ioguix
Jul 29, 2017
Maintainer

ioguix
Oct 5, 2017
Maintainer

ioguix
Mar 9, 2018
Maintainer

ioguix
Mar 11, 2018
Maintainer

ioguix
Mar 13, 2018
Maintainer

ioguix
Mar 15, 2018
Maintainer

blogh
Mar 15, 2018
Collaborator