-
Notifications
You must be signed in to change notification settings - Fork 0
/
todo.txt
49 lines (45 loc) · 1.89 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Set ephemeral runner to be killed if no job is picked up
Ensure that processes launched by slurm are killed at end of job
Before June 1
-------------
- Plan for replacement of head/login nodes with omega1
- backup of /home, /apps, and /cluster to omega1
- Set up automatic backups
- Use zfs raidz with two drives for /cluster
- Add harry,caffeine to cluster
- APC SmartUPS 2200 - battery charge flashing
Important
---------
- Default route wrong on login nodes
Later
-----
- Use autofs for home and apps mount to increase resiliency
- Restart jenkins runners
- Add bmcinterface to addresses table
- Move troy, orbitty, adrenochrome into cluster
- login node got wrong default route, need better method
- Need to set MaxNodeCount and SelectType=select/cons_res on slurmctld
- iPXE config: https://forum.ipxe.org/showthread.php?tid=7231&pid=10281
- Need to select node image based on mac address
- Script to clear out old node images
- Mail sending from headnode (SPF record is correct)
- Nagios monitoring
- Monitor: ping, ssh, uid resolving
- Disable/Enable nodes based on nagios checks
- Store node rsyslogs into other location
- cluster-aware fail2ban
- inventory script (web and command-line)
- Modules: set default for rocm/hip, cuda, gcc, and openmpi
- ZFS homedir quotas
- sudo support (with password?)
- Need DNS alias support (e.g. for homedir -> test1.cluster)
- Management: Create a class for "nodes" which has structure for querying cluster config
- Management: Use ansible to configure node image
- external docker context to get ansible config
- Store node rsyslogs into other location
- Nodes: custom commands (locate, module_variants)
- runall -r should ignore command-line arguments after the command-name
- reduce CPU use of containerd
- methane doesn't support ipmi lanplus?
- shutdown: Failed to set wall message, ignoring: Could not activate remote peer.
- docker registry with write restriction