Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu config api #684

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
7f765d7
First pass to allow gpu-type parameter
at88mph Aug 2, 2024
471603e
Merge branch 'fixes' into gpu-config-api
at88mph Aug 2, 2024
414fca7
Merge branch 'fixes-init-users' into gpu-config-api
at88mph Aug 2, 2024
d20ec2b
Merge branch 'fixes-launch-config' into gpu-config-api
at88mph Aug 2, 2024
7dd1918
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Aug 6, 2024
d94ee0f
Cleanup.
at88mph Aug 6, 2024
3173446
Chart update. Depends on ephem-storage-config branch being merged.
at88mph Aug 6, 2024
558751a
Add command to get current CUDA driver version.
at88mph Aug 9, 2024
b46455f
Consolidate helper templates and add GPU Version to environment.
at88mph Aug 9, 2024
ecde569
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Aug 12, 2024
82d4f5f
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Aug 15, 2024
5ac994c
Cleanup
at88mph Aug 15, 2024
3508fce
Typo fix.
at88mph Aug 15, 2024
f20f92a
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Aug 16, 2024
0adb12b
Image version update.
at88mph Aug 16, 2024
40ea9e2
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Sep 3, 2024
fd46606
Merge branch 'gpu-config-api' of https://github.com/at88mph/science-p…
at88mph Sep 3, 2024
ec56b61
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Sep 9, 2024
f7e56cb
Merge branch 'main' of https://github.com/opencadc/science-platform i…
at88mph Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions deployment/helm/skaha/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.7.0
version: 0.8.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "0.21.0"
appVersion: "0.22.0"

dependencies:
- name: "redis"
Expand Down
5 changes: 3 additions & 2 deletions deployment/helm/skaha/skaha-config/k8s-resources.properties
Original file line number Diff line number Diff line change
Expand Up @@ -32,5 +32,6 @@ cores-options = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Other options for memory (RAM) in GB
mem-gb-options = 1 2 4 6 8 10 12 14 16 20 24 26 28 30 32 36 40 44 48 56 64 80 92 112 128 140 170 192

# GPU options
gpus-options = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# GPU options
gpu-count:nvidia = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
gpu-count:amd = 1
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-carta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,11 @@ spec:
containers:
- name: "${skaha.jobname}"
env:
{{ template "skaha.job.environment.common" . }}
- name: skaha_hostname
value: "${skaha.hostname}"
- name: skaha_username
value: "${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: PWD
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: OMP_NUM_THREADS
Expand Down
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-contributed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,11 @@ spec:
containers:
- name: "${skaha.jobname}"
env:
{{ template "skaha.job.environment.common" . }}
- name: skaha_hostname
value: "${skaha.hostname}"
- name: skaha_username
value: "${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: PWD
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: JULIA_NUM_THREADS
Expand Down
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-desktop-app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,7 @@ spec:
- ${skaha.userid}
- ${software.containerparam}
env:
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
{{ template "skaha.job.environment.common" . }}
- name: DISPLAY
value: "${software.targetip}"
- name: GDK_SYNCHRONIZE
Expand Down
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-desktop.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,13 @@ spec:
args:
- ${skaha.hostname}
env:
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
{{ template "skaha.job.environment.common" . }}
- name: VNC_PW
value: "${skaha.sessionid}"
- name: skaha_hostname
value: "${skaha.hostname}"
- name: skaha_username
value: "${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: MOZ_FORCE_DISABLE_E10S
value: "1"
- name: SKAHA_API_VERSION
Expand Down
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-headless.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,11 @@ spec:
- name: "${skaha.jobname}"
# image and start of the 'env' label comes from the image bundle
${headless.image.bundle}
{{ template "skaha.job.environment.common" . }}
- name: skaha_hostname
value: "${skaha.hostname}"
- name: skaha_username
value: "${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: PWD
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: OMP_NUM_THREADS
Expand Down
5 changes: 1 addition & 4 deletions deployment/helm/skaha/skaha-config/launch-notebook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,11 @@ spec:
containers:
- name: "${skaha.jobname}"
env:
{{ template "skaha.job.environment.common" . }}
- name: skaha_hostname
value: "${skaha.hostname}"
- name: skaha_username
value: "${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: JUPYTER_TOKEN
value: "${skaha.sessionid}"
- name: JUPYTER_CONFIG_DIR
Expand All @@ -63,8 +62,6 @@ spec:
value: "${skaha.userid}"
- name: NB_UID
value: "${skaha.posixid}"
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: PWD
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: XDG_CACHE_HOME
Expand Down
14 changes: 13 additions & 1 deletion deployment/helm/skaha/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,19 @@ The affinity for Jobs.
{{- end }}

{{/*
Common security context settings for User Session Jobs
Common environment variables for User Sessions.
*/}}
{{- define "skaha.job.environment.common" -}}
- name: HOME
value: "${SKAHA_TLD}/home/${skaha.userid}"
- name: skaha_sessionid
value: "${skaha.sessionid}"
- name: "NVIDIA_CUDA_MAJOR_VERSION"
value: "${software.gpu.cuda.majorVersion}"
{{- end }}

{{/*
Common security settings for User Sessions.
*/}}
{{- define "skaha.job.securityContext" -}}
runAsUser: ${skaha.posixid}
Expand Down
2 changes: 2 additions & 0 deletions deployment/helm/skaha/templates/skaha-config-configmap.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
{{ $currContext := . }}

---
apiVersion: v1
kind: ConfigMap
metadata:
Expand Down
2 changes: 1 addition & 1 deletion deployment/helm/skaha/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ skahaWorkload:
deployment:
hostname: myhost.example.com # Change this!
skaha:
image: images.opencadc.org/platform/skaha:0.21.0
image: images.opencadc.org/platform/skaha:0.22.0
imagePullPolicy: Always

# Set the top-level-directory name that gets mounted at the root.
Expand Down
1 change: 1 addition & 0 deletions skaha/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
/.settings/
/.tmp/
/.classpth/
/src/intTest/resources/skaha-test*
2 changes: 1 addition & 1 deletion skaha/VERSION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## deployable containers have a semantic and build tag
# version tag: major.minor.patch
# build version tag: timestamp
VER=0.21.0
VER=0.22.0
TAGS="${VER} ${VER}-$(date -u +"%Y%m%dT%H%M%S")"
unset VER
2 changes: 1 addition & 1 deletion skaha/gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.6.1-bin.zip
distributionUrl=https\://services.gradle.org/distributions/gradle-6.8.3-bin.zip
networkTimeout=10000
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
Original file line number Diff line number Diff line change
Expand Up @@ -68,19 +68,12 @@
package org.opencadc.skaha;

import ca.nrc.cadc.auth.AuthMethod;
import ca.nrc.cadc.net.HttpGet;
import ca.nrc.cadc.reg.Standards;
import ca.nrc.cadc.reg.client.RegistryClient;
import ca.nrc.cadc.util.Log4jInit;
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import java.io.ByteArrayOutputStream;
import java.lang.reflect.Type;
import java.net.URL;
import java.security.PrivilegedExceptionAction;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import javax.security.auth.Subject;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@
* @author majorb
*/
public class SessionLifecycleTest {

public static final String PROD_IMAGE_HOST = "images.canfar.net";
public static final String DEV_IMAGE_HOST = "images-rc.canfar.net";
private static final Logger log = Logger.getLogger(SessionLifecycleTest.class);
Expand Down
5 changes: 3 additions & 2 deletions skaha/src/intTest/java/org/opencadc/skaha/SessionUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@
import java.util.Map;
import java.util.stream.Collectors;


public class SessionUtil {
public static final URI SKAHA_SERVICE_ID = URI.create("ivo://cadc.nrc.ca/skaha");
private static final Logger LOGGER = Logger.getLogger(SessionUtil.class);
Expand Down Expand Up @@ -346,7 +347,7 @@ static List<Session> getSessions(final URL sessionURL, String... omitStatuses) t
return active;
}

private static Session getDesktopApplicationSessionWithoutWait(final URL desktopApplicationURL, final String desktopApplicationSessionID,
private static Session getDesktopApplicationSessionWithoutWait(final URL desktopApplicationURL, final String desktopApplicationSessionID,
final String expectedState) {
return SessionUtil.getAllDesktopApplicationSessions(desktopApplicationURL).stream()
.filter(session -> session.getAppId().equals(desktopApplicationSessionID) && session.getStatus().equals(expectedState))
Expand Down Expand Up @@ -464,7 +465,7 @@ protected static Image getImageOfType(final String type) throws Exception {

protected static List<Image> getImagesOfType(final String type) throws Exception {
final RegistryClient registryClient = new RegistryClient();
final URL imageServiceURL = registryClient.getServiceURL(SessionUtil.SKAHA_SERVICE_ID,
final URL imageServiceURL = registryClient.getServiceURL(SessionUtil.getServiceID(),
Standards.PROC_SESSIONS_10, AuthMethod.TOKEN);
final URL imageURL = new URL(imageServiceURL.toExternalForm() + "/image");

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,39 +64,41 @@
*
************************************************************************
*/

package org.opencadc.skaha.context;

import ca.nrc.cadc.util.MultiValuedProperties;
import ca.nrc.cadc.util.PropertiesReader;

import ca.nrc.cadc.util.StringUtil;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;

import java.util.Map;
import org.apache.log4j.Logger;
import org.opencadc.skaha.SkahaAction;

/**
* @author majorb
*
*/
public class ResourceContexts {

private static final Logger log = Logger.getLogger(ResourceContexts.class);

private final Integer defaultRequestCores;
private final Integer defaultLimitCores;
private final Integer defaultCores;
private final Integer defaultCoresHeadless;
private final List<Integer> availableCores = new ArrayList<>();

// units in GB
private final Integer defaultRequestRAM;
private final Integer defaultLimitRAM;
private final Integer defaultRAM;
private final Integer defaultRAMHeadless;
private final List<Integer> availableRAM = new ArrayList<>();

private final List<Integer> availableGPUs = new ArrayList<>();
private final GPUConfigurations gpuConfigurations;

public ResourceContexts() {
try {
Expand All @@ -112,17 +114,15 @@ public ResourceContexts() {
defaultRAMHeadless = Integer.valueOf(mvp.getFirstPropertyValue("mem-gb-default-headless"));
String cOptions = mvp.getFirstPropertyValue("cores-options");
String rOptions = mvp.getFirstPropertyValue("mem-gb-options");
String gOptions = mvp.getFirstPropertyValue("gpus-options");


gpuConfigurations = new GPUConfigurations(mvp);

for (String c : cOptions.split(" ")) {
availableCores.add(Integer.valueOf(c));
}
for (String r : rOptions.split(" ")) {
availableRAM.add(Integer.valueOf(r));
}
for (String g : gOptions.split(" ")) {
availableGPUs.add(Integer.valueOf(g));
}
} catch (Exception e) {
log.error(e);
throw new IllegalStateException("failed reading k8s-resources.properties", e);
Expand Down Expand Up @@ -166,9 +166,65 @@ public Integer getDefaultRAM(String sessionType) {
public List<Integer> getAvailableRAM() {
return availableRAM;
}

public List<Integer> getAvailableGPUs() {
return availableGPUs;

/**
* Check if the requested GPU count, version, and type are valid.
*
* @param count The number of GPUs to request.
* @throws IllegalArgumentException when not valid.
* @see GPUConfigurations#verifyGPUConfiguration(int, String)
*/
public void verifyGPUConfiguration(final int count, final String type) {
log.info("ResourceContexts.isValidGPUConfiguration(" + count + ", " + type + ")");
this.gpuConfigurations.verifyGPUConfiguration(count, type);
}

private static final class GPUConfigurations {
private static final String GPU_CONFIGURATION_KEY = "gpu-count";
private static final String GPU_CONFIGURATION_KEY_TYPE_DELIMITER = ":";
private final Map<String, int[]> configurations;

/**
* Constructor. Build a type-count map from the configuration.
*
* @param multiValuedProperties The MultiValuedProperties created from a configuration file.
* Pertinent keys are expected to be in format <code>gpu-count:[amd | nvidia]</code> with a value of space delimited
* integers.
*/
public GPUConfigurations(final MultiValuedProperties multiValuedProperties) {
final Map<String, int[]> configurationsFromFile = new HashMap<>();
multiValuedProperties.keySet()
.stream()
.filter(key -> key.startsWith(GPUConfigurations.GPU_CONFIGURATION_KEY))
.filter(key -> StringUtil.hasText(multiValuedProperties.getFirstPropertyValue(key)))
.forEach(key -> {
final int[] value =
Arrays.stream(multiValuedProperties.getFirstPropertyValue(key).split(" "))
.mapToInt(Integer::parseInt)
.toArray();
final String type = key.split(GPUConfigurations.GPU_CONFIGURATION_KEY_TYPE_DELIMITER)[1].toLowerCase().trim();
configurationsFromFile.put(type, value);
});
this.configurations = Collections.unmodifiableMap(configurationsFromFile);
}

private void verifyGPUConfiguration(final int count, final String type) {
log.info("GPUConfigurations.isValidGPUConfiguration()");
if (!StringUtil.hasText(type) || count < 1) {
throw new IllegalArgumentException("GPU type (gpu-type) and GPU Count (gpus) are required fields.");
}

if (StringUtil.hasText(type)) {
final String gpuType = type.toLowerCase().trim();
if (!this.configurations.containsKey(gpuType)) {
throw new IllegalArgumentException("Unsupported GPU type '" + type + "'");
} else {
final int[] counts = this.configurations.get(gpuType);
if (counts == null || counts.length == 0) {
throw new IllegalArgumentException("No configured count values for GPU type '" + type + "'");
}
}
}
}
}

}
Loading