Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random test failures in github #939

Merged
merged 5 commits into from
Jun 22, 2023
Merged

Conversation

nirs
Copy link
Member

@nirs nirs commented Jun 22, 2023

Recently we see random failures in drenv tests when starting the vm. Try to debug
and avoid the failures.

nirs added 2 commits June 22, 2023 18:44
We see many failures when starting podman based environment on github.
Add verbose logging to get more info about the failures.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used 64 bit random number, which should be enough, but we see random
failures in github with this error:

    time="2023-06-22T15:04:59Z" level=warning msg="Error validating CNI
    config file /etc/cni/net.d/test-77836bdde14be1d8-cluster.conflist:
    [plugin bridge does not support config version \"1.0.0\" plugin
    portmap does not support config version \"1.0.0\" plugin firewall
    does not support config version \"1.0.0\" plugin tuning does not
    support config version \"1.0.0\"]" Error: volume with name
    test-77836bdde14be1d8-cluster already exists: volume already exists

This is probably not the reason for "volume already exists", but lets
make sure it cannot be the problem by using a 128 bit random prefix.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added 2 commits June 22, 2023 19:14
To report a bug in minikube we need to use the --alsologtostderr option.
Enable this option when using debug mode.

With this option minikube logs huge amount of debug log to stderr, but
we will see the logs only if the command fails in the commands.Error
traceback.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Log the minikube commands in debug level to make it eaiser to understand
failures on remote machines and verify changes in minikube commands.

Example of the new logs:

    $ drenv start envs/test.yaml -v
    2023-06-22 19:25:07,827 INFO    [test] Starting environment
    2023-06-22 19:25:07,920 INFO    [cluster] Starting minikube cluster
    2023-06-22 19:25:07,920 DEBUG   [cluster] Running ['minikube', 'start', '--profile', 'cluster',
    '--driver', 'podman', '--container-runtime', 'cri-o', '--disk-size', '20g', '--nodes', '1',
    '--cni', 'auto', '--cpus', '2', '--memory', '2g', '--alsologtostderr']
    ...

    $ drenv delete envs/test.yaml -v
    2023-06-22 19:26:05,932 INFO    [test] Deleting environment
    2023-06-22 19:26:05,933 INFO    [cluster] Deleting cluster
    2023-06-22 19:26:05,933 DEBUG   [cluster] Running ['minikube', 'delete', '--profile', 'cluster']
    ...

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs nirs force-pushed the tmpenv branch 3 times, most recently from 97a3c38 to d7ff7cb Compare June 22, 2023 17:24
@nirs nirs changed the title Try to fix random test failures in github Fix random test failures in github Jun 22, 2023
@nirs nirs marked this pull request as ready for review June 22, 2023 17:29
Copy link
Member

@raghavendra-talur raghavendra-talur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change is good but I don't think a clash of the random number was responsible for the error message. Very likely that the test created the volume twice.

@@ -37,21 +37,23 @@ environment.
[Install clusteradm CLI tool](https://open-cluster-management.io/getting-started/installation/start-the-control-plane/#install-clusteradm-cli-tool)
for the details. Version 0.5.0 or later is required.

1. Install `podman`
1. Install `docker`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a pretty big change and is claiming incompatibility with podman. I suggest that you update the commit message with example issues that you have seen with podman. If possible, also link to one or more runs on github actions where the failures have occurred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poadmn is considered experimental in minikube:
https://minikube.sigs.k8s.io/docs/drivers/podman/

We used it because it was a way to get rid of docker, and it seems to work, but
since we added it was removed from all environments. We kept it only in the test
env used for running the tests locally and in github. Now that it does not work
in github there is no reason to use it.

We can consider using podman again when it is officially supported and works
for us in github.

If you look in the latest github actions in this week you will find that all of
them failed in drenv tests in the same way:

@raghavendra-talur
Copy link
Member

@nirs If you are confident about the changes, then we don't need any change in the code; but please provide references to the failures such that podman can be enabled for drenv tests later by verifying that the issues are resolved.

The podman driver is flakeky locally and consistently failing in github
actions now. Switching to docker fix the issues in github.

Developers need to install docker to run drenv test locally. I tried to
use the podman-docker pacakge but it does not emulate docker good
enough for minikube.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs
Copy link
Member Author

nirs commented Jun 22, 2023

@nirs If you are confident about the changes, then we don't need any change in the code;

All the changes are useful regardless of replacing pomand with docker for the drenv
tests. They make the code easier to debug for the next time, and eliminate sources
of possible failures.

but please provide references to the failures such that podman can be enabled for drenv tests later by verifying that the issues are resolved.

Provided in the other comment.

@nirs
Copy link
Member Author

nirs commented Jun 22, 2023

Yes, this change is good but I don't think a clash of the random number was responsible for the error message. Very likely that the test created the volume twice.

No chance that the volume was created by the test, the only place calling drenv
with this profile pattern (test-xxx-clsuter) is in conftest.py, and this code is
called exactly once by pytest when we start the tests.

Even if we assume that pytest is bugy and this is called more than once, starting
the minikube profile twice should succeed - minkikube reuses the exiting vm.

I think this is an issue with the podman driver, maybe only on Ubuntu since locally
this runs fine, but we don't have much control over the github actions environment.

The most important thing now is to unbreak our CI - once the CI is green again we have
time to investigate and experiment with different Ubuntuo version or podman version or
other changes.

I filed a minikube bug for this issue, hopefully they will have some ideas how to
make podman work again:
kubernetes/minikube#16755

@raghavendra-talur raghavendra-talur merged commit df0d819 into RamenDR:main Jun 22, 2023
@nirs nirs self-assigned this Jun 23, 2023
@nirs nirs added the test Testing related issue label Jun 23, 2023
@nirs nirs deleted the tmpenv branch September 7, 2023 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Testing related issue
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants