Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10 seconds timeout for k8spm #257

Merged
merged 1 commit into from
Mar 28, 2024
Merged

Conversation

plasorak
Copy link
Contributor

To test this. Get the latest nightly, and checkout production/v4 of listrev. Then:

$ cat >lr.json <<EOL
{
    "boot":{
        "k8s_image": "ghcr.io/dune-daq/alma9-run:develop",
        "process_manager": "k8s",
        "ers_impl":"cern",
        "opmon_impl":"cern",
        "use_connectivity_service": false,
        "start_connectivity_service":false
    }
}
EOL

$ listrev_gen  -c lr.json lr
$ scale_listrev_app --num-apps 100  lr
$ nanorc --pm k8s://np04-srv-016:31000 lr session-name
# ... start the run etc.

It also passed the minimal_system_quick_test integ tests.

@bieryAtFnal bieryAtFnal added the miscellaneous deliverable A change that is/will be part of a release but is not substantial enough to be a daq-deliverable label Mar 26, 2024
@plasorak
Copy link
Contributor Author

It was pointed out that this creates "listrev application bombs" on the whole cluster, so before starting the run, make sure to select the nodes that won't interfere with data taking.
Before executing scale_listrev_app, add the following in your boot.json:

{
    "apps": {
        "listrev-app-s-0": {
            "...",
            "node-selection": [
                {
                    "kubernetes.io/hostname": [
                        "np02-srv-001",
                        "np02-srv-003",
                        "np02-srv-004",
                        "np04-srv-011",
                        "np04-srv-012",
                        "np04-srv-013",
                        "np04-srv-015",
                        "np04-srv-018",
                        "np04-srv-019",
                        "np04-srv-024",
                        "np04-srv-031"
                    ],
                    "strict": true
                }
            ],
        }
    }
}

and make sure that the boot.json has this snippet for each app after running scale_listrev_app.

@TiagoTAlves
Copy link
Contributor

Should we just add it to the script? @plasorak

@plasorak
Copy link
Contributor Author

I'd say no, as this snippet is np04-specific and this list will likely change according to the data-taking conditions. This is for testing now at np04, it will likely not be valid if and when we need to do these tests later on.

@TiagoTAlves TiagoTAlves self-assigned this Mar 27, 2024
@TiagoTAlves TiagoTAlves self-requested a review March 27, 2024 16:53
Copy link
Contributor

@TiagoTAlves TiagoTAlves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been tested with 100 Listrev applications in Kubernetes cycling through the commands and no failures observed for now. I will attempt to test using an actual k8s configuration and cycle through all of the commands multiple times in an attempt to observe failures

Will test this before approving

@TiagoTAlves TiagoTAlves merged commit 9d0a19f into production/v4 Mar 28, 2024
1 check passed
@TiagoTAlves TiagoTAlves deleted the plasorak/timeouts branch March 28, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
miscellaneous deliverable A change that is/will be part of a release but is not substantial enough to be a daq-deliverable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants