Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testsuite: test restart of postgres #6539

Merged
merged 2 commits into from
Mar 5, 2024

Conversation

nadvornik
Copy link
Contributor

@nadvornik nadvornik commented Jan 27, 2023

What does this PR change?

This test restarts postgres service and verifies, that it is handled correctly in java and salt - applying highstate must work.

Links

This should catch the bugs discussed in https://github.com/SUSE/spacewalk/issues/20143

  • DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

@github-actions
Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Oct 25, 2023
@Bischoff Bischoff self-requested a review October 25, 2023 10:46
Copy link
Contributor

@Bischoff Bischoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, but please apply your suggested change (remove the Wait)

testsuite/features/secondary/srv_restart_postgres.feature Outdated Show resolved Hide resolved
Copy link
Contributor

@ktsamis ktsamis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Eric's suggestion is done, approved

Copy link
Member

@nodeg nodeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpicks regarding the wording and the PR needs a rebase.

testsuite/features/step_definitions/command_steps.rb Outdated Show resolved Hide resolved
testsuite/features/secondary/srv_restart_postgres.feature Outdated Show resolved Hide resolved
testsuite/features/secondary/srv_restart_postgres.feature Outdated Show resolved Hide resolved
@nodeg nodeg removed the Stale label Oct 25, 2023
@nodeg
Copy link
Member

nodeg commented Oct 25, 2023

Rebase done.

@nodeg
Copy link
Member

nodeg commented Oct 25, 2023

The scenario is definitely too fast:

Feature: Restart the PostgreSQL service

  Scenario: Restart the PostgreSQL database service                          # features/secondary/srv_restart_postgres.feature:6
      This scenario ran at: 2023-10-25 15:45:58 +0200
    Given I am authorized for the "Admin" section                            # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
    When I restart the "postgresql.service" service on "server"              # features/step_definitions/command_steps.rb:656
    And I wait until "postgresql" service is active on "server"              # features/step_definitions/command_steps.rb:240
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
    And I am on the Systems overview page of this "sle_minion"               # features/step_definitions/navigation_steps.rb:437
    And I follow "States" in the content area                                # features/step_definitions/navigation_steps.rb:326
    And I click on "Apply Highstate"                                         # features/step_definitions/navigation_steps.rb:283
    Then I should see a "Applying the highstate has been scheduled." text    # features/step_definitions/navigation_steps.rb:613
Timeout after 90 seconds (Timeout.timeout): I still see text 'Apply highstate scheduled by admin'
    And I wait until event "Apply highstate scheduled by admin" is completed # features/step_definitions/common_steps.rb:135
      execution expired (Timeout::Error)
      ./features/step_definitions/navigation_steps.rb:75:in `block (2 levels) in <top (required)>'
      ./features/support/commonlib.rb:94:in `block in repeat_until_timeout'
      ./features/support/commonlib.rb:83:in `repeat_until_timeout'
      ./features/step_definitions/navigation_steps.rb:72:in `/^I wait at most (\d+) seconds until I do not see "([^"]*)" text, refreshing the page$/'
      ./features/step_definitions/common_steps.rb:142:in `/^I wait (\d+) seconds until the event is picked up and (\d+) seconds until the event "([^"]*)" is completed$/'
      ./features/step_definitions/common_steps.rb:159:in `/^I wait at most (\d+) seconds until event "([^"]*)" is completed$/'
      ./features/step_definitions/common_steps.rb:136:in `/^I wait until event "([^"]*)" is completed$/'
      features/secondary/srv_restart_postgres.feature:14:in `I wait until event "Apply highstate scheduled by admin" is completed'
=> /var/log/rhn/rhn_web_ui.log

@github-actions
Copy link
Contributor

github-actions bot commented Oct 25, 2023

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕
You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/6539/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/6539/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

@nodeg nodeg force-pushed the test-postgres branch 2 times, most recently from 7554195 to 9e03797 Compare October 26, 2023 11:50
@nodeg
Copy link
Member

nodeg commented Oct 26, 2023

I created a step for checking the taskomatic log as Oscar suggested.

  Scenario: Restart the PostgreSQL database service
    Given I am authorized for the "Admin" section
    When I restart the "postgresql.service" service on "server"
    And I wait until "postgresql" service is active on "server"
    And I am on the Systems overview page of this "sle_minion"
    And I follow "States" in the content area
    And I click on "Apply Highstate"
    Then I should see a "Applying the highstate has been scheduled." text
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"  # <------ new
    And I wait until event "Apply highstate scheduled by admin" is completed
When(/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/) do |text, file, host|
  node = get_target(host)
  repeat_until_timeout(message: "Entry #{text} in file #{file} on #{host} not found") do
    _output, code = node.run("tail -n 10 #{file} | grep '#{text}' ", check_errors: false, verbose: true)
    break if code.zero?
  end
end
(...)
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor'  returned status code = 1.
Output:
''
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
      Timeout after 250 seconds (repeat_until_timeout): Entry MinionActionExecutor in file /var/log/rhn/rhn_taskomatic_daemon.log on server not found (RuntimeError)
      ./features/support/commonlib.rb:100:in `block in repeat_until_timeout'
      ./features/support/commonlib.rb:83:in `repeat_until_timeout'
      ./features/step_definitions/command_steps.rb:1628:in `/^I wait until I see "([^"]*)" in file "([^"]*)" on "([^"]*)"$/'
      features/secondary/srv_restart_postgres.feature:15:in `I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"'
    And I wait until event "Apply highstate scheduled by admin" is completed                                   # features/step_definitions/common_steps.rb:135
=> /var/log/rhn/rhn_web_ui.log

However, when we do not wait a certain amount of time before we execute the high state on the minion, this event will not get picked up, and therefore I do not see what I am looking for in the taskomatic log file. So when applying highstate while taskomatik is not fully reconnected, it does not trigger the high state and leaves it in the pending queue.

image

Any suggestions or comments, @srbarrios?

@nodeg
Copy link
Member

nodeg commented Oct 26, 2023

If we leave the timeout in, the scenario succeeds:

dominik-ctl:~/spacewalk/testsuite # cucumber features/secondary/srv_restart_postgres.feature
(..)

  Scenario: Restart the PostgreSQL database service                                                            # features/secondary/srv_restart_postgres.feature:6
      This scenario ran at: 2023-10-26 15:12:58 +0200
    Given I am authorized for the "Admin" section                                                              # features/step_definitions/navigation_steps.rb:425
systemctl restart postgresql.service returned status code = 0.
Output:
''
    When I restart the "postgresql.service" service on "server"                                                # features/step_definitions/command_steps.rb:656
    And I wait until "postgresql" service is active on "server"                                                # features/step_definitions/command_steps.rb:240
    And I wait for "60" seconds                                                                                # features/step_definitions/common_steps.rb:19
Initializing a twopence node for 'sle_minion'.
Host 'sle_minion' is alive with determined hostname dominik-min-sles and FQDN dominik-min-sles.mgr.suse.de
Node: dominik-min-sles, OS Version: 15-SP4, Family: sles
    And I am on the Systems overview page of this "sle_minion"                                                 # features/step_definitions/navigation_steps.rb:437
    And I follow "States" in the content area                                                                  # features/step_definitions/navigation_steps.rb:326
    And I click on "Apply Highstate"                                                                           # features/step_definitions/navigation_steps.rb:283
    Then I should see a "Applying the highstate has been scheduled." text                                      # features/step_definitions/navigation_steps.rb:613
tail -n 10 /var/log/rhn/rhn_taskomatic_daemon.log | grep 'MinionActionExecutor'  returned status code = 0.
Output:
'2023-10-26 15:14:14,090 [DefaultQuartzScheduler_Worker-6] INFO  com.redhat.rhn.taskomatic.task.MinionActionExecutor - Executing action: 149
'
    And I wait until I see "MinionActionExecutor" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server" # features/step_definitions/command_steps.rb:1626
    And I wait until event "Apply highstate scheduled by admin" is completed                                   # features/step_definitions/common_steps.rb:135
      This scenario took: 114 seconds

1 scenario (1 passed)
10 steps (10 passed)
1m54.133s

@srbarrios
Copy link
Member

srbarrios commented Oct 26, 2023

Any suggestions or comments, @srbarrios?

Sorry took me a while to jump here.
Yeah, so I proposed to monitor the taskomatic log at another point of time in fact.
SO, something like:

  Scenario: Restart the PostgreSQL database service
    Given I am authorized for the "Admin" section
    When I restart the "postgresql.service" service on "server"
    And I wait until "postgresql" service is active on "server"
    And I wait until I see "xxxxx" in file "/var/log/rhn/rhn_taskomatic_daemon.log" on "server"
    And I am on the Systems overview page of this "sle_minion"
    And I follow "States" in the content area
    And I click on "Apply Highstate"
    Then I should see a "Applying the highstate has been scheduled." text
    And I wait until event "Apply highstate scheduled by admin" is completed

I tried to identify the text we need to look for by navigating into our product code, looking for tasko, quartz and postgresql words, but I couldn't find it. I would say we even need to add a new log to the Java side in order to properly monitor this reconnection from the database.

But... on python code, I found this debug log entry that can help with this monitoring:

https://github.com/uyuni-project/uyuni/blob/master/susemanager-utils/susemanager-sls/modules/pillar/suma_minion.py#L125

My guess is that's printed on salt-master logs

@nodeg
Copy link
Member

nodeg commented Oct 27, 2023

/var/log/salt/master also does not contain anything useful, we only the error messages.

dominik-srv:/ # tail /var/log/salt/master
2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:147 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly

2023-10-26 14:14:00,525 [salt.loaded.ext.engines.mgr_events:151 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 14:14:00,526 [salt.loaded.ext.engines.mgr_events:188 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 14:14:00,773 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR   ][27293] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)
2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:147 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: SSL connection has been closed unexpectedly

2023-10-26 15:14:14,467 [salt.loaded.ext.engines.mgr_events:151 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Error commiting: connection already closed
2023-10-26 15:14:14,469 [salt.loaded.ext.engines.mgr_events:188 ][ERROR   ][27280] salt.loaded.ext.engines.mgr_events: Diconnected from database. Trying to reconnect...
2023-10-26 15:14:14,717 [salt.loaded.ext.pillar.suma_minion:150 ][ERROR   ][27285] Error on getting database pillar, trying again: ('SSL connection has been closed unexpectedly\n',)

We concluded in Slack that there is no easy solution for this right now, so I added the initial waiting time from Vladimir back but also introduced a step to check for the triggered high state event in the Taskomatik log file.

@nodeg nodeg requested a review from srbarrios October 27, 2023 09:08
Copy link
Member

@srbarrios srbarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. We couldn't find something better, so we will need to have this sleep :(

@nodeg
Copy link
Member

nodeg commented Jan 18, 2024

We need to disable this test for now, according to mc. It did cause too much issues in 4.3 where it already was merged. I will open a bug report and adjust the PR to comment the test out in the .yml file.

@nodeg
Copy link
Member

nodeg commented Jan 18, 2024

@raulillo82 This can also be merged at some point to align the branches with 4.3, too. It is not exactly a test suite fix, though. I did rebase the branch again.

@nodeg nodeg assigned nodeg and unassigned nodeg Mar 1, 2024
@nodeg
Copy link
Member

nodeg commented Mar 5, 2024

Rebase done. Once the tests are finished, I will merge the PR.

@nodeg nodeg merged commit 09c6163 into uyuni-project:master Mar 5, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants