Skip to content

knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.

License

Notifications You must be signed in to change notification settings

NVIDIA/knavigator

Repository files navigation

Logo

Knavigator

Build Status Static Badge

Overview

Knavigator is a project designed to analyze, optimize, and compare scheduling systems, with a focus on AI/ML workloads. It addresses various needs, including testing, troubleshooting, benchmarking, chaos engineering, performance analysis, and optimization.

The term "knavigator" is derived from "navigator," with a silent "k" prefix representing "kubernetes." Much like a navigator, this initiative assists in charting a secure route and steering clear of obstacles within the cluster.

Knavigator interfaces with Kubernetes clusters to manage tasks such as manipulating with Kubernetes objects, evaluating PromQL queries, as well as executing specific operations.

Knavigator can operate both outside and inside a Kubernetes cluster, leveraging the Kubernetes API for task management.

To facilitate large-scale experiments without the overhead of running actual user workloads, Knavigator utilizes KWOK for creating virtual nodes in extensive clusters.

Architecture

knavigator architecture

Components

  • K8S control plane: a set of components that manage the state and configuration of a vanilla Kubernetes cluster.
  • Scheduling Framework: cloud-native job scheduling system for batch, HPC, AI/ML, and similar applications in a Kubernetes cluster.
  • KWOK: Allows for the rapid setup of simulated Kubernetes clusters with minimal resource usage.
  • Knavigator: Facilitates communication with the Kubernetes cluster via the Kubernetes API, enabling task management and data retrieval.
  • Metrics & Dashboard: Gathers and processes metrics from the cluster, focusing on scheduling performance and resource utilization.

Workflow

Knavigator offers versatile configuration options, allowing it to function independently, serve as an HTTP/gRPC server, or seamlessly integrate as a package or library within other systems.

In its standalone mode, Knavigator can be set up using a descriptive YAML file, where users specify the sequence of tasks to be executed. This mode is ideal for isolated testing scenarios where Knavigator operates independently.

Alternatively, in server or package configurations, Knavigator can receive a series of API calls to define the tasks to be performed. This mode facilitates integration with existing systems or frameworks, providing flexibility in how tasks are defined and managed.

Regardless of the configuration mode, Knavigator executes tasks sequentially. Each task is dependent on the successful completion of the preceding one. Therefore, if any task fails during execution, the entire test is marked as failed. This ensures comprehensive testing and accurate reporting of results, maintaining the integrity of the testing process.

Documentation

About

knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.

Resources

License

Stars

Watchers

Forks

Packages