Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of the configuration autodetection script #240

Open
wants to merge 39 commits into
base: alps
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f5ce0c4
Initial version of the configuration autodetection
Oct 17, 2024
db5919e
Fix bug in user input check
Oct 18, 2024
6a22ca5
Fix bugs
Oct 18, 2024
e0179a0
Fix bugs/add jinja template for .py
Oct 18, 2024
1a1fa89
Add README.md
Oct 18, 2024
9bc2842
Restructure for job submission and command line options
Oct 23, 2024
474d667
Fix end line
Oct 23, 2024
1eb3865
Add features in README.md
Oct 23, 2024
284147d
Fix bugs in reservations (auto)
Oct 23, 2024
f1f9756
Fix bugs / add containers in features
Oct 23, 2024
9676417
Address PR comments
Oct 24, 2024
ec5b84b
Comment line
Oct 24, 2024
56ec0af
Add features exclusion / Fix bugs
Oct 24, 2024
b4abc32
Fix device auto detection
Oct 24, 2024
5157e89
Fix line spaces
Oct 24, 2024
1cad3cf
Fix detection of devices with Gres
Oct 24, 2024
25d4ada
Fix device count from Slurm
Oct 24, 2024
a4a53b2
Improve GPU detection (only NVIDIA) / Add partition in case failed jo…
Oct 28, 2024
c7cc3e6
Improve Gres detection for gpus
Oct 29, 2024
7917e3e
Add container detection with tmod
Oct 29, 2024
9225349
Fix syntax error
Oct 29, 2024
680257f
Add AMD GPUs detection
Oct 29, 2024
ac93ac1
Fix partitions access
Oct 29, 2024
888bb2b
Refactor code
Nov 5, 2024
b5906c0
Fix last line
Nov 5, 2024
a490557
Add time out policy
Nov 5, 2024
eccdc7a
Fix lmod detection
Nov 5, 2024
fe0228d
Fix lmod detection
Nov 5, 2024
37f2f4e
Add access command line option
Nov 6, 2024
225df63
Fix access options for eiger
Nov 6, 2024
5dbb4ec
Update README.md
Nov 6, 2024
501f6ec
Fix avail in lmod
Nov 7, 2024
277c8e1
Fix access options
Nov 8, 2024
663edc6
Added GH200 model
Nov 8, 2024
d229b42
Add devices model to gpu
Nov 8, 2024
2017292
Fix asyncio for python >3.7
Nov 8, 2024
7727595
Fix format/address PR comments
Nov 8, 2024
7cd68ad
Add verbosity control and formatting
Nov 12, 2024
d7f0163
Improve -v specifications
Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Configuration autodetection for new systems

The ReFrame configuration file can be automatically generated for a new system using ```generate.py```.

## Features

- Detection of system name
- Detection of hostname
- Detection of module system
- Detection of scheduler
- Detection of parallel launcher
- Detection of partitions based on node types (node features from Slurm) [only when the scheduler is **Slurm**]
- Detection of partitions based on reservations [only when the scheduler is **Slurm**]
- Detection of available container platforms in remote partitions (and required modules when the modules system is ```lmod``` or ```tmod```) [only when the scheduler is **Slurm**]
- Detection of devices with architecture in the nodes (GRes from Slurm) [only when the scheduler is **Slurm**]

## Usage

### Install Jinja2 and autopep8 python packages

```sh
pip install jinja2
pip install autopep8
```

### Basic usage

```sh
python3 generate.py
```

The script is run in **interactive** mode. User input is used to detect and generate the final configuration of the system. The user input can be supressed by passing the ```--auto``` option.

## Available Arguments

| Argument | Description |
|-----------------------------------------------------|-----------------------------------------------------------------------------------------|
| `--auto` | Disables user input. |
| `--exclude=[list_of_features]` | List of features to be excluded in the detection of node types |
| `--no-remote-containers` | Disables the detection of containers in the remote partitions |
| `--no-remote-devices` | Disables the detection of devices (slurm GRes) in the remote partitions |
| `--reservations=[list_reservations]` | Allows the specification of the reservations in the system for which a partitions should be created |
| `--prefix` | Shared directory where the jobs for remote detection will be created and submitted |
| `--access` | Additional access options that must be included for the sbatch submission |
| `-v` | Adjust the verbosity level to debug in ```auto``` mode |

```sh
python3 generate.py --auto
```

With this option, user input is not required to generate the configuration file. In the ```auto``` mode the following partitions are automatically created:

- Login partition
- Partition for each node type (based on Slurm AvailableFeatures)

If additional partitions for a specific reservations are required, the ```--auto``` option can be combined with ```--reservations=reserv_1,reserv_2``` in order to create partitions for ```reserv_1``` and ```reserv_2``` respectively.

```sh
python3 generate.py --auto --reservations=reserv_1,reserv_2
```

In the ```auto``` mode the detection of container platforms and devices is by default enabled. This requires the submission of a job per partition to detect these features. The script will wait until the job is completed. This job submission can be disabled through the options ```--no-remote-containers``` and ```--no-remote-devices``` respectively. Note that by default if no Gres is detected in a node, no device detection script will be submitted.

The options ```--no-remote-containers``` and ```--reservations=[list_reservations]``` are only used in the ```auto``` mode. The option ```--no-remote-devices``` is valid for both interactive and ```auto``` modes.

**Excluding node features from the node types filtering**

In order to exclude some features from the detection of the different node types, these can be passed to the script in the command line using the option ```--exclude=[list_of_features]```. Patterns can also be specified in this option using ```*```.

*Usage example:*

```sh
python3 generate.py --exclude=group*,c*,r*
```

Running this will ignore the node features that match those patterns. Node A with features ```(gpu,group2,column43,row9)``` and Node B ```(gpu,group8,column1,row75)``` will be identified as the same node type and included in the same partition.

**Specifying additional access options**

Additional access options can be passed to the script through the ```--access``` command line option.

*Usage example:*

```sh
python3 generate.py --access=-Cgpu
```

This option will add ```-Cgpu``` to the access options for the remote partitions in the configuration file and use it submit the remote detection jobs for container platforms and devices.

## Generated configuration files

The script generates a ```py``` file with the system configuration

- .py: ```<system_name>```_config.py
137 changes: 137 additions & 0 deletions config/generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Copyright 2024 Swiss National Supercomputing Centre (CSCS/ETH Zurich)
# ReFrame Project Developers. See the top-level LICENSE file for details.
#
# SPDX-License-Identifier: BSD-3-Clause

import argparse
import autopep8
import os
from jinja2 import Environment, FileSystemLoader
from utilities.config import SystemConfig
from utilities.io import getlogger, set_logger_level

JINJA2_TEMPLATE = 'reframe_config_template.j2'


def main(user_input, containers_search, devices_search, reservs,
exclude_feat, access_opt, tmp_dir):

# Initialize system configuration
system_info = SystemConfig()
# Build the configuration with the right options
system_info.build_config(
user_input=user_input, detect_containers=containers_search,
detect_devices=devices_search, exclude_feats=exclude_feats,
reservs=reservs, access_opt=access_opt, tmp_dir=tmp_dir
)

# Set up Jinja2 environment and load the template
template_loader = FileSystemLoader(searchpath='.')
env = Environment(loader=template_loader,
trim_blocks=True, lstrip_blocks=True)
rfm_config_template = env.get_template(JINJA2_TEMPLATE)

systemn_info_jinja = system_info.format_for_jinja()
# Render the template with the gathered information
organized_config = rfm_config_template.render(systemn_info_jinja)

# Output filename for the generated configuration
output_filename = f'{system_info.systemname}_config.py'

# Format the content
formatted = autopep8.fix_code(organized_config)

# Overwrite the file with formatted content
with open(output_filename, "w") as output_file:
output_file.write(formatted)

getlogger().info(
f'\nThe following configuration files was created:\n'
f'PYTHON: {system_info.systemname}_config.py', color=False
)


if __name__ == '__main__':

# Create an ArgumentParser object
parser = argparse.ArgumentParser()

# Define the '--auto' flag
parser.add_argument('--auto', action='store_true',
help='Turn off interactive mode')
# Define the '--no-remote-containers' flag
parser.add_argument(
'--no-remote-containers', action='store_true',
help='Disable container platform detection in remote partition'
)
# Define the '--no-remote-devices' flag
parser.add_argument('--no-remote-devices', action='store_true',
help='Disable devices detection in remote partition')
# Define the '--reservations' flag
parser.add_argument(
'--reservations', nargs='?',
help='Specify the reservations that you want to create partitions for.'
)
# Define the '--exclude' flag
parser.add_argument(
'--exclude', nargs='?',
help='Exclude the certain node features for the detection ' +
'of node types'
)
# Define the '--prefix' flag
parser.add_argument(
'--prefix', action='store',
help='Shared directory for remote detection jobs'
)
# Define the '--access' flag
parser.add_argument(
'--access', action='store',
help='Compulsory options for accesing remote nodes with sbatch'
)
# Define the '--access' flag
parser.add_argument(
'-v', action='store_true',
help='Set the verbosity to debug for the auto mode'
)

args = parser.parse_args()

user_input = not args.auto

containers_search = True
if args.no_remote_containers:
containers_search = False

devices_search = True
if args.no_remote_devices:
devices_search = False

if args.reservations:
reservs = args.reservations.split(',')
else:
reservs = False

if args.exclude:
exclude_feats = args.exclude.split(',')
else:
exclude_feats = []

if args.prefix:
if os.path.exists(args.prefix):
tmp_dir = args.prefix
else:
raise ValueError('The specified d--prefix was not found')
else:
tmp_dir = []

if args.access:
access_opt = args.access.split(',')
else:
access_opt = ''

user_input = not args.auto

set_logger_level(args.v or user_input)

main(user_input, containers_search, devices_search,
reservs, exclude_feats, access_opt, tmp_dir)
96 changes: 96 additions & 0 deletions config/reframe_config_template.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright 2024 Swiss National Supercomputing Centre (CSCS/ETH Zurich)
# ReFrame Project Developers. See the top-level LICENSE file for details.
#
# SPDX-License-Identifier: BSD-3-Clause

# This is a generated ReFrame configuration file
# The values in this file are dynamically filled in using the system's current configuration

site_configuration = {
'systems': [
{
'name': '{{ name }}', # Name of the system
'descr': 'System description for {{ name }}', # Description of the system
'hostnames': {{hostnames}}, # Hostname used by this system
'modules_system': '{{modules_system}}',
{% if modules %}
# Specify the modules to be loaded in the system when running reframe (if any)
# https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.modules
'modules': {{ modules }},
{% endif %}
{% if resourcesdir %}
# https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.resourcesdir
'resourcesdir': '{{ resourcesdir }}', # Directory path for system resources
{% endif %}
ekouts marked this conversation as resolved.
Show resolved Hide resolved
# Define the partitions of the system (based on node type or reservations)
# !!Partition autodetection is only available for the slurm scheduler
'partitions': [
{% for partition in partitions %}
{
'name': '{{partition.name}}',
'descr': '{{partition.descr}}',
'launcher': '{{partition.launcher}}', # Launcher for parallel jobs
'environs': {{partition.environs}}, # Check 'environments' config below
'scheduler': '{{partition.scheduler}}',
'time_limit': '{{partition.time_limit}}',
'max_jobs': {{partition.max_jobs}},
{% if partition.features | length > 1 %}
# Resources for testing this partition (https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.resources)
'resources': [{'name': 'switches',
'options': ['--switches={num_switches}']},
{'name': 'gres',
'options': ['--gres={gres}']},
{'name': 'memory',
'options': ['--mem={mem_per_node}']}],
# https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.extras
'extras': {{partition.extras}},
# https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.env_vars
'env_vars': {{partition.env_vars}},
{% if partition.devices %}
# Check if any specific devices were found in this node type
# The gpus found in slurm GRes will be specified here
'devices': [
{% for dev in partition.devices %}
{ 'type': '{{dev.type}}',
'model': '{{dev.model}}',
{% if dev.arch %}
'arch': '{{dev.arch}}',
{% endif %}
'num_devices': {{dev.num_devices}}
},
{% endfor %}
],
{% endif %}
{% if partition.container_platforms %}
# Check if any container platforms are available in these nodes and add them
# https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#container-platform-configuration
'container_platforms': [
{% for c_p in partition.container_platforms %}
{ 'type': '{{c_p.type}}', # Type of container platform
{% if c_p.modules %}
# Specify here the modules required to run the container platforms (if any)
'modules': {{c_p.modules}}
{% endif %}
},
{% endfor %}
],
{% endif %}
{% endif %}
{% if partition.access %}
# Options passed to the job scheduler in order to submit a job to the specific nodes in this partition
'access': {{partition.access}},
{% endif %}
{% if partition.features %}
# Node features detected in slurm
'features': {{partition.features}},
{% endif %}
},
{% endfor %}
],
},
],
# The environments cannot be automatically detected, check the following links for reference
# 'https://github.com/eth-cscs/cscs-reframe-tests/tree/alps/config/systems': CSCS github repo
# 'https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#environment-configuration': ReFrame documentation
'environments': []
}
Loading