Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix critical services issue on pg follower restart #8548

Merged
merged 12 commits into from
Sep 13, 2024

Conversation

vivekshankar1
Copy link
Collaborator

@vivekshankar1 vivekshankar1 commented Aug 22, 2024

directly send data pg leader from pg-gateway

🔩 Description: What code changed, and why?

  • pg gateway changes so that it directly talks to postgres instead of haproxy
  • health check retry for erchef and nginx
  • pinned curl version in automate-cs-nginx
  • deployment service config changes to now support health check interval and services list. After patching we need to manually stop and start automate for changes to reflect. Once changes has been applied, even if you start more it will persist.
  • input.tf change 7432 to 5432 (config.toml.erb)
  • made fail_timeout configurable
  • Add config changes in dev default-config

https://chefio.atlassian.net/browse/CHEF-15716
https://chefio.atlassian.net/browse/CHEF-15717
https://chefio.atlassian.net/browse/CHEF-15718
https://chefio.atlassian.net/browse/CHEF-15719

CHEF-15716_MAKE_CONFIGURABLE_FAIL_TIMEOUT_ES_GATEWAY.mp4
CHEF-15717_HEALTH_CHECK_INTERVAL_CONFIGURABLE.mp4
CHEF-15718_OPENSEARCH_REBOOT_NO_FE_SERVICE_CRITICAL.mov
CHEF-15719_PG_FOLOWER_REBOOT_FE_SERVICE_NO_CRITICAL.mov

Upgrade

⛓️ Related Resources

👍 Definition of Done

👟 How to Build and Test the Change

✅ Checklist

All PRs must tick these:

With occasional exceptions, all PRs from Progress employees must tick these:

  • Is the code clear? (complicated code or lots of comments--subdivide and use well-named methods, meaningful variable names, etc.)
  • Consistency checked? (user notifications, user prompts, visual patterns, code patterns, variable names)
  • Repeated code blocks eliminated? (adapt and reuse existing components, blocks, functions, etc.)
  • Spelling, grammar, typos checked? (at a minimum use make spell in any component directory)
  • Code well-formatted? (indents, line breaks, etc. improve rather than hinder readability)

All PRs from Progress employees should tick these if appropriate:

  • Tests added/updated? (all new code needs new tests)
  • Docs added/updated? (all customer-facing changes)

Please add a note next to any checkbox above if you are NOT ticking it.

📷 Screenshots, if applicable

Copy link

netlify bot commented Aug 22, 2024

👷 Deploy Preview for chef-automate processing.

Name Link
🔨 Latest commit b96c4c9
🔍 Latest deploy log https://app.netlify.com/sites/chef-automate/deploys/66e4322443f4f10008df4265

@vivekshankar1 vivekshankar1 force-pushed the vs/directly-connect-to-pg-ha-from-pg-gateway branch from 49e8b24 to 5df37cf Compare September 6, 2024 13:06
@vivekshankar1 vivekshankar1 changed the title fix critical services issue on pg follower restart [WIP] fix critical services issue on pg follower restart Sep 10, 2024
components/automate-cs-nginx/habitat/plan.sh Outdated Show resolved Hide resolved
components/automate-cs-nginx/habitat/plan.sh Outdated Show resolved Hide resolved
exit 2
fi

for i in $(seq 1 5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we suppose to add sleep time in between the calls?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
{{~#each cfg.service.parsed_nodes as |node|}}
{{~#if node.is_domain}}
{{~#if ../cfg.resolvers.nameservers }}
server-template {{node.address}} 8 {{node.address}}:{{node.port}} check resolvers pgdns init-addr none resolve-prefer ipv4
{{else}}
server {{node.address}} {{node.address}}:{{node.port}} check
server {{node.address}} {{node.address}}:{{node.port}} maxconn 350 check port 6432
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we consider this value -> maxconn 350
earlier this is the part of backend ha proxy and its value is 350. And there are 3 HAproxy at backend.
Now the pg-gateway on each FE will have this value, does we need to increase the number of PG connection on the backend package ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased default max_connection to 1500

vivekshankar1 and others added 10 commits September 13, 2024 16:19
…a to pg leader from pg-gateway

Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: punitmundra <pmundra@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: Vivek Shankar <vshankar@progress.com>
@vivekshankar1 vivekshankar1 force-pushed the vs/directly-connect-to-pg-ha-from-pg-gateway branch from 6829532 to 7e4ea92 Compare September 13, 2024 10:52
@punitmundra punitmundra changed the title [WIP] fix critical services issue on pg follower restart fix critical services issue on pg follower restart Sep 13, 2024
Signed-off-by: Vivek Shankar <vshankar@progress.com>

Quality Gate failed Quality Gate failed

Failed conditions
55.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

@@ -89,6 +89,9 @@ shared_buffers = "1GB"
username = "admin"
name = "Local Administrator"
password = "chefautomate"
[deployment.v1.svc.health]
health_check_interval=31
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 31 ?

Signed-off-by: Vivek Shankar <vshankar@progress.com>
@punitmundra punitmundra merged commit 34eb900 into main Sep 13, 2024
9 of 12 checks passed
@punitmundra punitmundra deleted the vs/directly-connect-to-pg-ha-from-pg-gateway branch September 13, 2024 12:41
kalroy pushed a commit that referenced this pull request Oct 8, 2024
* fix critical services issue on pg follower restart. Directly send data to pg leader from pg-gateway

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* added health check retry

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* updated working patch health check config for provided services

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* renamed folder

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* review comments

Signed-off-by: punitmundra <pmundra@progress.com>

* fail timout added in es, changes port from 7432 to 5432

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* added log in cs health hook

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* path after upgrade

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* added max conn configurable

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* fix policy

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* fix policy

Signed-off-by: Vivek Shankar <vshankar@progress.com>

* revert pg

Signed-off-by: Vivek Shankar <vshankar@progress.com>

---------

Signed-off-by: Vivek Shankar <vshankar@progress.com>
Signed-off-by: punitmundra <pmundra@progress.com>
Co-authored-by: punitmundra <pmundra@progress.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants