Replies: 20 comments
-
Hi, On a two node cluster, you could have a different recovery.conf.pcmk on each node. Each file pointing to the other node. (I never tested this, but I think it will work) As far as I understand the design behind PAF, I don't think that modifying PAF to edit the recovey.conf file will happen anytime soon. The idea being to use existing infrastructure to keep the agent simple. @ioguix is more involved in this project than me. Let's see what he thinks. Benoit |
Beta Was this translation helpful? Give feedback.
-
Thinking a little more about that 2 nodes cluster thing ... I am not sure of what would happened when the resource is started (they will be started as standbies with valid targets whereas with a normal setup there is no VIP so no valid target) I have to try this. |
Beta Was this translation helpful? Give feedback.
-
You probably can script something with local name resolution (/etc/hosts). How do you want so solve the client side? With DNS? |
Beta Was this translation helpful? Give feedback.
-
Indeed. We have some more work to do on other subjects before adding some more complexity in the code base.
Indeed, this would created a "loop" in the replication between each slaves. However, I'm not sure what would happen. This has to be tested, at least to know what happen. But it feels wrong to me anyway, whatever the result of the test. Maybe we could draw something with the Another way to escape would be to provide the master nodename of the resource as a cluster attribute after the promotion so seomthing can catch it a create appropriate iptables rules, or any other kind of action...This could be discussed for 2.3 I guess. |
Beta Was this translation helpful? Give feedback.
-
The client runs its own haproxy, with a trivial HTTP server on each Postgres server to do healthchecking and determine the current master. The client is actually a Kubernetes cluster, so there is one redundant copy of haproxy for all applications in the cluster. This is the opposite way to how we usually do it, with haproxy managed by Pacemaker, but the limitations of GCE's networking means it works better. (With MySQL/MariaDB Galera clusters, we can eliminate the VIP entirely, which means no need to interact with GCE to make routing changes.) Unfortunately the two node solutions won't work for us since we have three nodes. I'm not sure about automatic iptables rules: if adding the rule failed for some reason, then it could end up with two Postgres servers replicating from each other, right? Maybe PAF just isn't suitable for this configuration. (In which case I will see if we can make the VIP work better, perhaps with a network overlay like Calico...) |
Beta Was this translation helpful? Give feedback.
-
If you already have 3 Nodes go after Zalandos patroni - It handles all the stuff for you and seems to be the perfect solution. https://github.com/zalando/patroni |
Beta Was this translation helpful? Give feedback.
-
+1 for patroni. It seems to fit perfectly for container-based clusters (the devs couldn't tell me if this could run out of a containers during pgconf.eu 2016). It will not cover all the failure scenario Pacemaker can cover though, but I would bet it fills 80-90% of the most frequent of them. |
Beta Was this translation helpful? Give feedback.
-
It perfectly works on bare metal from the very beginning. And actually it's other way around, it was tricky to implement container support (run patroni wih pid=1), because it also must act as init process (handle unexpected SIGCHLDs).
I wonder, what isn't covered yet? 1.3 supports Linux Watchdog. |
Beta Was this translation helpful? Give feedback.
-
Nice! Really, you both guys were not on the same page during your talk, it was quite confusing to understand the answer. Moreover, I couldn't find anything about Patroni on bare metal. It is most often (always?) shown in a container-based architecture. Anyway, good to hear it works great on bare metal, I finally have my answer :)
I should admit I do not follow the Patroni dev cycle, I'm not aware of new features. Congrats for supporting Watchdog and for your new release 1.3. My "feeling" about patroni covering 80-90% of failure scenarios comes from distinct (old) infos. "Patroni: Your HA Patron Saint" conference is one of them when in introduction Josh says building a HA system fitting everyone needs is quite hard, but we can at least build one covering 80% of the solution. Another point is the fact that Patroni relies on etcd to failover, without taking care of the old master status and fencing it before electing another standby as master. I suppose we can add some fencing in the But again, I might have misunderstood something or did not found the good doc here. Cheers, |
Beta Was this translation helpful? Give feedback.
-
Personally I prefer to show it in a containers. Container is not a magic, just think about it like about tiny physical machine. Although in the demo every container runs on a single laptop, they have own "isolated" processes, network, storage and memory, i.e. processes from one container can communicate with processes in another container only via the network. The word isolated is quoted because at the end all resources are shared, but it is still good enough for modeling.
Ok, now I've got it. Requirements to HA could be really different and it's clearly not possible to meet all of them. Somebody needs failure detection and failover in less than one second, somebody else wants to have zero data loss without synchronous replication. Obviously Patroni can't cover it and wont event target.
There could be different failure situations:
First lets talk about network partitioning:
Patroni death on master node is more tricky and most interesting once, because postgres will continue to run as a master.
What is not covered so far? The case when Patroni runs on bare metal, but without Watchdog.
Haproxy periodically send health-check probes to Patroni REST API (to all nodes), to figure out where the master runs. If Patroni is died on master, haproxy will exclude this node from load balancing. Basically it means master will stop receiving a traffic. Also haproxy will terminate all existing connections. The only thing which is not really covered and could be potentially dangerous - Patroni + Postgres + VIP. Here I could only suggest to run Patroni with supervisor script, which will restart Patroni if it died. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure what to do with this issue. Considering the question of @unixwitch, it seems to me the key point here is:
Does it sum up the question? How do you think PAF should support recovery.conf generation to eg. update the primary_conninfo (implying a standby restart on Pacemaker decision)? |
Beta Was this translation helpful? Give feedback.
-
(sorry in advance for hijhacking this issues with a Patroni discussion). I'm back to the subject as I spend some days playing with Patroni. I didn't dig too far, but at least I can answer your messages 7 month later :) Most of your In other words, fencing means other nodes are killing the old master to put a real et confirmed state on a node which does not reply anymore, should it be because of network, load or angry unicorns. Here is an illustration: https://ourobengr.com/stonith-story/ I think Patroni is in a pretty good shape. I have to study it a little bit to understand if the fencing part could be integrated in the callback logic. The only concern I have is about failing fencing. Should the fencing fails, the callback must be able to cancel the election. I know this is counter intuitive, but this is needed to avoid split brain. It seems to me repmgr callbacks are able to cancel promotion. At least some of my colleagues already rely on this in some failing fencing situations. @CyberDem0n, if you have some input or guidance, it would save me some time to get quickly to the goal with my limited bandwidth. Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
Let's assume that we have 3 data centers in different locations, DC1, DC2, DC3 and master is running in DC1. What Patroni does in this case: it fails to update a leader lock in Etcd (because it is on losing side of partitioned network) and restarts postgres in read-only mode.
One should be really careful with callbacks. They are executed asynchronously and if you are using them to move virtual ip or something similar there is a race condition: patroni/patroni#536 Although on practice we never experienced it.
It looks like I still don't really understand your definition of fencing. Is it about connecting to the old node in killing patroni/postgres? Is it about restarting the node with the help of ipmi or something similar? Or let's put it more general: is it about accessing the old node via some network protocol and doing some actions in order to make sure that postgres can't be accessed after that? And what should happen when the network is broken? |
Beta Was this translation helpful? Give feedback.
-
Wide cluster are different from local cluster, because of the network obviously. But let's consider it. Everything must be redundant, even the WAN network between DC. If this network is not redundant, you have a SPoF. But anyway, if your network is 100% down then, you probably can not fence the isolated node (unless you have a GSM backup network or similar). You must rely on something similar to a quorum. In Pacemaker galaxy, this is called "Cluster Ticket Registry (CTR)" and it looks like the Patroni algorithm (but is quite older). CTR can speed-up the shutdown process if needed using some fencing method.
Argh, this is quite a problem. That means callbacks are not able to cancel a promotion, right? On a local cluster with only two nodes, fencing is mandatory. Not being able to confirm the fencing succeed, means you are not safe. If fencing failed, the promotion must be abort. Using repmgr, we add fencing in the promotion script and abort the promotion if needed. This is critical as I consider repmgr safe only for two-nodes clusters...as far as the fencing is set-up. As far as I understand it, Patroni sounds fine for 2 nodes clusters only thanks to the quorum policy on etcd side (which requires at least 3 nodes). But quorum is not a silver bullet either... In short, watchdog != quorum != fencing. A proper HA setup should have the three of them.
Fencing could be:
Again, most of them means the network must be redundant, no SPoF. A cluster usually can not survive two distinct failures. This is a big rule. However, note that PDU and UPS can often be managed from their serial port, bypassing the network altogether. As Pacemaker support multi-level fencing, we can rely on two fencing method if needed. Back to Patroni. I really think you should consider a solution so admin can add some logic to abort promotion on some external logic. That would help to add some fencing logic in the process. Do you think this is something possible? Would you accept external contribution for such feature? Thanks, |
Beta Was this translation helpful? Give feedback.
-
Correct. They are asynchronous, and executed more or less right after calling
If you are using some kind of load balancer it is not really necessary. Off course things change if you use VIP.
Patroni solely relies on external quorum (etcd, zookeeper, consul), therefore running just two postgres nodes if fine.
All right, got it. Unfortunately in most cases those things are not really controlled by DBA :(
How it currently works:
Improvement sound fairly simple: call some external script which does fencing before calling There is a one minor problem here. What if it takes too long to run such script?
Sure. |
Beta Was this translation helpful? Give feedback.
-
Load balancer can fail, and they do not update the routings rules right during the role transition. This opens many windows where connexions can fail or be routed to the wrong node.
Yes, that was what I was stating as well. This is a really smart architecture.
Indeed. This is not always a deal breaker in various situations.
Depend on the productions constraints: service availability or data availability :) In sort, either block/do nothing or pray and keep going with the promotion. We usually block.
I'll ask some colleagues who knows python if this is feasible on their side. |
Beta Was this translation helpful? Give feedback.
-
Hi guys,
Jan |
Beta Was this translation helpful? Give feedback.
-
Yes, this is quite useful, indeed. You rely on client tcp setup though. TCP rety and timeout might delay to moment where the client will react to the switchover/failover.
I already tried to implement A as well, but gave up because the node list format depend on the underlaying cluster architecture (CMAN vs coro 1.x vs. coro 2.x). I'll have to check your code for B. Could be imported upstream I suppose. @YanChii, did you check Pacemaker's CTR setup for geo clusters ? |
Beta Was this translation helpful? Give feedback.
-
If you make a lot of small connections, it can slow down your system (especially if there's a non-responding node). But I hope this will soon be solved by a pgbouncer or simillar projects.
I know. My change is working only on CentOS systems. But I think it's only a matter of one uname+if statement. We know already the output format of
That would be cool. But you'll probably want it to enable by some config var (not enabled by default) because it creates an explicit CRM config entry (the only way I was able to make the info node-independent and persistent accross reboots).
I did. It is a very complicated setup. I would go for patroni instead :). Jan |
Beta Was this translation helpful? Give feedback.
-
Hi Jan, Good to see you again :)
I dont like the idea for our purpose since we don't control where the client will connect anymore. There is always the risk that someone starts a cluster in R/W outside of the pacemaker (human error or because reasons). In that case, we rely on the order of the hosts in the connection string to connect on the right host (or need to remember to modify the connection strings). Of course if it's not possible to have a vip... It's better to have this than nothing. |
Beta Was this translation helpful? Give feedback.
-
Hello,
Is it possible to use PAF without a virtual IP address - for example, to have recovery.conf generated based on what the current master hostname or IP address is?
Our use-case for this is running on Google Cloud Engine, where there is no layer 2 networking between hosts; getting a functional virtual IP is quite complex (it requires using the GCE API to move the address between hosts) and also error prone (sometimes the API seems to fail). Since we don't need a virtual IP address for clients, it would be nice if we could avoid it for PAF as well.
Beta Was this translation helpful? Give feedback.
All reactions