- Add support for optional payload encryption in the client SDK and CLI as a follow-up to #484 (#493)
- Allow unicode characters in project(user group) name and domain name. (#1663)
- Improve exception logging stability by pre-formatting exception objects instead of pickling/unpickling them (#1759)
- Add new API to create new image from live session (#1973)
- Clear
error_logs
records in theclear-history
command (#1989) - Introduce
mgr schema dump-history
andmgr schema apply-missing-revisions
command to ease the major upgrade involving deviation of database migration histories (#2002) - Update
image forget
CLI command to untag image from registry before forgetting it from the database (#2010) - Update
etcd-client-py
to 0.3.0 (#2014) - Allow self-ssh in single-node single-container compute sessions. (#2032)
- Prevent deleting mounted folders. (#2036)
- Allow agent to report its internal registry snapshot via UNIX domain socket server (#2038)
- New redis client (experimental) (#2041)
- Expose user info to environment variables (#2043)
- Introduce the
rolling_count
GraphQL field to provide the current rate limit counter for a keypair within the designated time window slice (#2050) - Deprecate the reliance on HTTP cookies for authenticating the pipeline service, switching to the use of HTTP headers instead (#2051)
- Allow user to explicitly set filename of model definition YAML (#2063)
- Add the
backend.ai plugin scan
command to inspect the plugin scan results from various entrypoint sources (#2070) - Bring back etcetra-backed Etcd as an option for ditributed lock backend (#2079)
- Enable distribute-lock configuration (#2080)
- Cache volume objects in
RootContext.get_volume
(#2081) - Revamp images GQL query by changing image filtering from flag-based to feature set-based and add
aliases
field to customized image GQL schema (#2136) - Added missing fields for
keypair_resource_policy
in client-py, models, etc. (#2146) - Add parameters to
check-presets
SDK function (#2153) - Add relay-aware
VirtualFolderNode
GQL Query (#2165) - Also perform basic model service validation process when updating model service via
ModifyEndpoint
(#2167) - Add support for mounting arbitrary VFolders on model service session (#2168)
- Add support for CentOS 8 based kernels (#2220)
- Clear zombie routes automatically (#2229)
- Add
scaling_group.agent_count_by_status
andscaling_group.agent_total_resource_slots_by_status
GQL fields to query the count and the resource allocation of agents that belong to a scaling group. (#2254) - Allow modifying model service session's environment variable setup (#2255)
- Add
endpoint.runtime_variant
column (#2256) - Add new API to show list of supported inference runtimes (#2258)
- Add support for model service provisioning without
model-definition.yaml
(#2260) - Allow superadmins to force-update session status through destroy API. (#2275)
- Add session status check & update API. (#2312)
- Add support for fetching container logs of a specific kernel. (#2364)
- Introduce Python native WSProxy (#2372)
- Implement scanning plugin entrypoints of external packages (#2377)
- Add
row_id
,type
andcontainer_registry
fields to theGroupNode
GQL schema. (#2409) - Add support for PureStorage RapidFiles Toolkit v2 (#2419)
- Add API that extends lifespan of webserver's login session. (#2456)
- Allow bulk association and disassociation of scaling groups with domains, user groups, and key pairs. (#2473)
- Match container's timezone to container host OS when available (#2503)
- Add a pre-setup configuration menu to the TUI installer to allow setting the public-facing address of Backend.AI components (#2541)
- Now Backend.AI can run arbitrary container images without Backend.AI-specific metadata labels by introducing good default values and replacing intrinsic kernel-runner binaries with statically built ones (#2582)
- Allow
Bearer
as valid token type on model service authentication (#2583) - Introduce automatic creation of a 'model-store' group upon inserting a new domain. (#2611)
- Add support for declaring custom description field for GraphQL
relay
edge types. (#2643) - Add an
enable_LLM_playground
option to show/hide the LLM playground tab on the serving page. (#2677) - Add
max_gaudi2_devices_per_container
config on webserver (#2685) - Add
max_atom_plus_device_per_container
config on webserver (#2686) - Introduce Account-manager component. (#2688)
-
- Add query depth limit config of GQL.
- Add page size limit config of GQL Connection.
- Set default page size of GQL Connection to 10. (#2709)
- Add compute session GQL Relay query schema. (#2711)
- Allow
DataLoaderManager
to get a loader function by function itself rather than function name. (#2717) - Allow filter and order in endpointlist gql request. (#2723)
- Add new vfolder API to update sharing status. (#2740)
- Avoid raising a type error even if a particular table in the toml file is empty, as long as the default value for all settings exists. (#2782)
- Add an explicit configuration
scaling-group-type
toagent.toml
so that the agent could distinguish whether itself belongs to an SFTP resource group or not (#2796) - Add per-session priority attributes and
ModifyComputeSession
GraphQL mutation to update session names and priorities (#2840) - Add dependee/dependent/graph ComputeSessionNode connection queries (#2844)
- Implement the priority-aware scheduler that applies to any arbitrary scheduler plugin (#2848)
- Add support for setting a timeout when pulling Docker images and upgrade aiodocker to version 0.23.0. (#2852)
- Enable robust DB connection handling by allowing
pool-pre-ping
setting. (#1991) - Enhance update mechanism of session & kernel status. (#2311)
- Remove database-level foreign key constraints in
vfolders.{user,group}
columns to decouple the timing of vfolder deletion and user/group deletion. (#2404) - Implement storage-host RBAC interface. (#2505)
- Optimize the query latency when fetching a large number of agents with stat metrics from Redis (#2558)
- Split out
ai.backend.logging
package from theai.backend.common
to improve reusability and reduce the startup time (i.e., import latencies) (#2760) - Avoid using
collections.OrderedDict
when not necessary in the manager API and client SDK (#2842)
- Remove no longer used
env-tester-{admin,user,user2}.sh
scripts and all references (#1956)
- Merge
kernels.role
intosessions.session_type
and check the image compatibility based on comparison with theai.backend.role
label (#1587) - Refactor
PendingSession
Scheduler intoPendingSession
scheduler andAgentSelector
, and replaceroundrobin
flag withAgentSelectionStrategy.RoundRobin
policy. (#1655) - Do not omit to update session's occupying resources to DB when a kernel starts. (#1832)
- Fix DDN command output handling when exceeding quotas. (#1901)
- Explicitly specify the storage-side UID/GID when creating qtrees in the NetApp storage backend (#1983)
- Sync mismatch between
kernels.session_name
andsessions.name
and fix session-rename API to updatesession_name
of sibling kernels atomically. (#1985) - Change function default arguments from mutable object to
None
. (#1986) - Revert some VFolder APIs response type to remove mismatch between
Content-Type
header and body. (#1988) - Upgrade pants to 2.21.0.dev4 for Python 3.12 support in their embedded pex/pip versions (#1998)
- Fix Graylog log adapter not working after upgrading to Python 3.12 (#1999)
- Fix
compute_container
GraphQL query resolver functions. (#2012) - Fix harbor v2 image scanner skipping importing rest of the artifacts when any of the item does not include tag (#2015)
- Let external log viewers display more accurate, meaningful stack frames of logger invocations. (#2019)
- Fix handling of undefined values in the ModifyImage GraphQL mutation. (#2028)
- Fix container commit not working on certain docker engine versions (#2040)
- add omitted request fetching from client to manager about deleting vfolder in trash bin. (#2042)
- Fix a buggy restriction on VFolder deletion due to wrong query condition (#2055)
- Fix wrong usage of dataloader in GQL group resolver. (#2056)
- Ensure that vfolders, including automount vfolders, are mounted during session creation only if their status is not set to "DEAD" (i.e., deleted). (#2059)
- Fix wrong calculation of resource usage (#2062)
- Fix VFolder file operation not working when user has been granted access to shared but deleted VFolder which has same name with the normal one (#2072)
- Add missing type argument in group query (#2073)
- Let the
backend.ai mgr clear-history
command clears session records as well as kernel records (#2077) - Fix
compute_session_list
GQL query not responding on an abundant amount of sessions (#2084) - Fix VFolder invitation not accepted when inviting VFolder shares name with already deleted one (#2093)
- Fix orphan model service routes being created (#2096)
- Fix initialization of the resource usage API's kernel-level usage aggregation (#2102)
- Fix model server starting on every kernels (including sub role kernels) on multi container infernce session (#2124)
- Add missing
commit_session_to_file
toOP_EXC
(#2127) - Fix wrong SQL query build for GQL Relay node (#2128)
- Pass ImageRef.canonical in
commit_session_to_file
(#2134) - Handle fileset-already-exists response of
create-filset
API request and make sure to wait between all GPFS job polling iterations (#2144) - Skip any possible redundant quota update requests when creating new quota (#2145)
-
- Fix error when calling
check_presets
Client SDK API with an invalidgroup
parameter - Rewrite Client SDK to access all APIConfig fields (#2152)
- Fix error when calling
- Ensure that all pending sessions are picked by schedulers (#2155)
- Fix user creation error when any model-store does not exists. (#2160)
- Fix buggy resolver of
model_card
GQL Query. (#2161) - Fix security vulnerability for
sudo_session_enabled
(#2162) - Rename
endpoints.model_mount_destiation
tomodel_mount_destination
(#2163) - Wait for real quota scope directory creation after Netapp
create_qtree()
call (#2170) - Fix wrong per-user concurrency calculation logic (#2175)
- Keep
sync_container_lifecycles()
bgtask alive in a loop. (#2178) - Fix missing check for group (project) vfolder count limit and error handling with an invalid
group
parameter (#2190) - Fix model service persisting on
degraded
status forever in rare chance when trying to delete the service (#2191) - Fix error when query or mutate GraphQL using
BigInt
field type (#2203) - Ensure that utilization idleness is checked after a set period. (#2205)
- Fix
backend.ai ssh
command execution when packaged as SCIE/PEX (#2226) -
- fix
endpoints
query not working when trying to loadimage_row.aliases
- fix
endpoints.status
reportingPROVISIONING
when its status is inDESTROYING
state (#2233)
- fix
- Fix GQL raising error when trying to resolve
endpoints.errors
field occasionally (#2236) - Fix
ZeroDivisionError
in volume usage calculation by returning 0% when volume capacity is zero (#2245) - Fix GraphQL to support query to non-installed images (#2250)
- Add missing
push_image
method implementation to Dummy Agent (#2253) - Rename no-op
access_key
parameter ofendpoint_list
GQL Query touser_uuid
(#2287) - Fix
ai.backend.service-ports
label syntax broken when image does not expose built-in service port (#2288) - Improve stability of
untag_image_from_registry
mutation (#2289) - SSH not working between kernels started with customized image (#2290)
- Invalid container memory capacity reported (#2291)
- Corrected an issue where the
resource_policy
field in the user model was incorrectly mapped todomain_name
. (#2314) - Omit to clean containerless kernels which are still creating its container. (#2317)
- Fix model service sessions created before 24.03.5 failing to spawn (#2318)
- Image commit not working (#2319)
- model service session scheduler (
scale_services()
) failing when sessions bound to active route already marked as terminated (#2320) - Fix container metric collection halted on systems with Cgroups v1 (#2321)
- Run batch execution after the batch session starts. (#2327)
- Add support for configuring
sync_container_lifecycles()
task. (#2338) - Fix mismatches between responses of
/services/_runtimes
and new model service creation input (#2371) - Fix incorrect check of values returned from docker stat API. (#2389)
- Shutdown agent properly by removing a code that waits a cancelled task. (#2392)
- Restrict GraphQL query to
user_nodes
field to requiresuperadmin
privilege (#2401) - Handle all possible exceptions when scheduling single node session so that the status information of pending session is not empty. (#2411)
- Utilize
ExtendedJSONEncoder
for error logging to handleUUID
objects inextra_data
(#2415) - Change outdated references in event module from
kernels
tosessions
. (#2421) - Upgrade
inquirer
to remove dependency on deprecateddistutils
, which breaks up execution of the scie builds (#2424) - Allow specific status of vfolders to query to purge. (#2429)
- Update the install-dev scripts to use
pnpm
instead ofnpm
to speed up installation and resolve some peculiar version resolution issues related to esbuild. (#2436) - Fix a packaging issue in the
backendai-webserver
scie executable due to missing explicit requirement of setuptools (#2454) - Improve pruning of non-physical filesystems when measuring disk usage in agents (#2460)
- Update the install-dev scripts to install
pnpm
if pnpm isn't installed. (#2472) - Improve error handling of initialization failures in the kernel runner (#2478)
- Fix
BACKEND_MODEL_NAME
environment always overwritten to model name specified at model definition (#2481) - Do not allow assigning preopen port which collides with image's own service port definition (#2482)
- Fix GET requests with queryparams defined in API spec occasionally throwing 400 Bad Request error (#2483)
- Check null value of user mutation by
Undefined
sentinel value rather thanNone
. (#2506) - Do null check on
groups.total_resource_slots
anddomains.total_resource_slots
value. (#2509) - Fix hearbeat processing failing when agent reports image with its name not compilant to Backend.AI's naming rule (#2516)
- Corrected a typo (
maanger
corrected tomanager
) in thecheck_status()
API response of the storage component (#2523) - Rename
images.image_filters
GQL Query argument toimages.image_types
(#2555) - Prevent session status from being transit to
PULLING
status event if image pull is not required (#2556) - Prevent other user's customized image from being listed as a response of
images
GQL query (#2557) - skip resolving malformed
ModelCard
GQL item (#2570) - Delete sessions DB records when purging project. (#2573)
- Initialize Redis connection pool objects with specified connection opts rather than ignoring them. (#2574)
- Fix
GET /func/folders/{folderName}
API returning string literal"null"
instead of null value onuser
andgroup
fields (#2584) - Update
GQLPrivilegeCheckMiddleware
to align with upstream changes ongraphql-core
package (#2598) - Robust type check when idle checker fetches utilization data. (#2601)
- Skip mounting zero-byte lxcfs files when lxcfs is activated to prevent crashes in session containers (#2604)
- Fix typo in minilang query field spec and column map. (#2605)
- Remove duplicate CPU quota arguments when creating containers (#2608)
- Increase
MAX_CMD_LEN
of dropbear to improve compatibility with PyCharm debugger (#2613) - Silence falsy Redis timeout warnings when retrying blocking commands if the timeout does not exceed the expected command timeout (#2632)
- Fix a regression of #2483 in the session-download API used by the
backend.ai ssh
command (#2635) - Implement missing
StrEnumType
handling inpopulate_fixture()
. (#2648) - Let
GET /resource/usage/period
request contain data in query parameter rather than JSON body. (#2661) - Allow sudo-enabled container users to ovewrite
/usr/bin/scp
and/usr/libexec/sftp-server
by unifying the intrinsic ssh binaries to use the mergeddropbearmulti
executable. (#2667) - Update
webserver
logout API to respond with HTTP 200 OK (#2681) - Fix WSProxy not properly handling WebSocket request sent from Firefox (#2684)
- Scan parent directory of created qtree to avoid creating quota on non-existing directory. (#2696)
- Fix
list_files
,get_fstab_contents
,get_performance_metric
andshared_vfolder_info
Python SDK function not working withValidationError
exception printed (#2706) - Resolve the issue where the vfolder id does not match in
list_shared_vfolders
. (#2731) - Handle OS Error when deleting vfolders. (#2741)
- Fix typo in Virtual-folder status update code. (#2742)
- Correct
msgpack
deserialization ofResourceSlot
. (#2754) - Fix regression error of
session create_from_template
command. (#2761) - Silence
model_
namespace warnings with pydantic-based model classes (#2765) - Change the initialization order of PackageContext to apply
target_path
correctly in the TUI installer (#2768) - Make the regex patterns to update configuration files working with multiline texts correctly in the TUI installer (#2771)
- Omit null parameter when call
usage-per-period
API. (#2777) - Delete vfolder invitation and permission rows when deleting vfolders. (#2780)
- Handle container port mismatch when creating kernel. (#2786)
- Explicitly set the protected service ports depending on the resource group type and the service types (#2797)
- Correct session status determiner function. (#2803)
- Fix
endpoint_list.total_count
GQL field returning incorrect value (#2805) - Fix
Service.create()
SDK method andservice create
CLI command not working withUnboundLocalError
exception (#2806) - Refresh expiration time of login session when login. (#2816)
- Fix
kernel_id
assignment for main kernel log retrieval (#2820) - Use a safer TLS version (v1.2) when creating SSL sockets in the logstash handler (#2827)
- Wrong count of concurrent compute sessions. (#2829)
- Create kernels with correct
scaling_group
value. (#2837) - Fix a regression in progress bar rendering of the TUI installer after upgrading the Textual library (#2867)
- Add note about installing client library with same version as server (#1976)
- Remove deprecated
version
from the docker compose YAML templates in package installation docs. (#2035) - Fix a typo in the
agent.toml
example of the package-based installation guide to have a duplicate double quote (#2069)
- Upgrade the base runtime (CPython) version from 3.11.6 to 3.12.2 (#1994)
- Upgrade aiodocker to v0.22.0 with minor bug fixes found by improved type annotations (#2339)
- Update the halfstack containers to point the latest stable versions (#2367)
- Upgrade aiodocker to 0.22.1 to fix error handling when trying to extract the log of non-existing containers (#2402)
- Upgrade the base CPython from 3.12.2 to 3.12.4 (#2449)
- Upgrade Python (3.12.4 -> 3.12.6) and common/tool dependencies to prepare for Python 3.13 and apply latest fixes (#2851)
- Wrap RPC authentication error to custom error for better logging. (#1970)
- Add
requested_slots
field to compute session GQL type. (#1984) - Allow
pydantic.BaseModel
as the API handler return schema. (#1987) - Fix incorrect version notation of GQL Field. (#1993)
- Add max_pending_session_count field to Keypair resource policy GQL schema (#2013)
- Handle container creation exception and start exception in separate try-except contexts. (#2316)
- Fix broken the workflow call for the action that auto-assigns PR numbers to news fragments (#2358)
- Finally stabilize the hanging tests in our CI due to docker-internal races on TCP port mappings to concurrently spawned fixture containers by introducing monotonically increasing TCP port numbers (#2379)
- Further improve the monotonic port allocation logic for the test containers to remove maximum concurrency restrictions (#2396)
- Add PEX, SCIE binary build configs for the plugin subsystem. (#2422)
-
- Add POST
/folders
API endpoints to replace DELETE APIs that require request body. - Allow
DELETE
requests to have body data. (#2571)
- Add POST
- Enhacne type hints for potential
None
arguments (#2580) - Add
ai.backend.manager.models.graphql
module for better code base management. (#2669) - Remove Scheduler related types that are no longer used. (#2705)
- Allow adding required GQL field argument to schema. (#2712)
- Upgrade
readthedocs
build environment to Python 3.12 (#2814)## 24.03.0rc1 (2024-03-31)
- Allw filter
compute_session
query byuser_id
. (#1805) - Allow overriding vfolder mount permissions in API calls and CLI commands to create new sessions, with addition of a generic parser of comma-separated "key=value" list for CLI args and API params (#1838)
- Always enable
ai.backend.accelerator.cuda_open
in the scie-based installer (#1966) - Use
config["pipeline"]["endpoint"]
as default value ofconfig["pipeline"]["frontend-endpoint"]
when not provided (#1972)
- Set single agent per kernel resource usage. (#1725)
- Abort container creation when duplicate container port definition exists (#1750)
- To update image metadata, check if the min/max values in
resource_limits
are undefined. (#1941) - Explicitly disable the user-site package detection in the krunner python commands to avoid potential conflicts with user-installed packages in
.local
directories (#1962) - Fix
caf54fcc17ab
migration to drop a primary key only if it exists and in589c764a18f1
migration, add missing table arguments. (#1963)
- Update docstrings in
ai.backend.client.request.Request:fetch()
andai.backend.client.request.FetchContextManager
as the support for synchronous context manager has been deprecated. (#1801) - Resize font-size of footer text in ethical ads in documentation hosted by read-the-docs (#1965)
- Only resize font-size of footer text in ethical ads not in title of content in documentation (#1967)
- Revert response type of service create API. (#1979)
- Add a policy to predicate to limit the number and resources of concurrent pending sessions. (#1226)
- Implement
/services/_/try
API (#1754) - Add support for multi directory mkdir by fixing cli to accept multiple arguments and by adding list type annotation to accept multiple directories (#1803)
- Add GraphQL mutations to allow altering model service (endpoint) specs, especially for resource allocation and image enviroment setups (#1859)
- Add a new client-side admin command group to manager the quota scopes (
admin quota-scope
), replacing the legacy--quota
options in the vfolder management commands (#1862) - Implement customization of name displayed in the Graylog source field through the config file. (#1866)
- Add
allow_non_auth_tcp
as allowed webui config key (#1868) - Introduce
etcd-client-py
(#1870) - Enhance logging by adding more detailed exceptions when scheduling sessions, such as conditions like missing kernels or no agents available at all for the selected pending session (#1887)
- Add validate-image-canonical and validate-image-alias cli command (#1891)
- Re-define vfolder delete status by adding
delete-pending
,delete-error
vfolder status and setdelete-pending
as trash-bin status anddelete-complete
as hard-delete status. (#1892) - Add new test cases for vfolder CLI commands. (#1918)
- Add the
--non-interactive
flag to the TUI installer (#1922) - Bump the manager API version to v8.20240315 with some big changes memo'ed in manager/server.py (#1938)
- Add new
user_resource_policies.max_session_count_per_model_session
column to limit number of maximum available sessions per each model service created by user (#1948) - Add the --headless flag that run install as headless mode, skips terminal I/O (#1958)
- Remove the image importer API no longer used and unused since the release of Forklift (#1896)
- Write graphene field's deprecation/description message string directly instead of using message generation functions. (#1734)
- Let
mgr agent ping
command to use the agent'sping()
RPC API instead of unimplementedget_node_hwinfo()
, and fix the mis-interpretation of the compute plugin's metadata reports in the agent'sgather_hwinfo()
. (#1793) - When creating a new user via the client admin CLI, always set the default option of the group to "default" to prevent mistakes of creating users without groups. (#1860)
- Fix malfunctioning CLI command
session create-from-template
by reorganizingclick.option
decorators (#1890) - Fix GQL Relay node resolver to parse
filter
andorder
argument into SQL query properly. (#1916) - Allow passing HTTP status codes via the pydantic-based API response model objects (#1927)
- Fix the potential missing resource slots when checking the remaining resources lots in the job scheduler (#1928)
- Fix inability to download beyond 500 MB via SFTP by preventing dropbear from decreasing the trasnfer window size indefinitely, which happens with non-retrying psftp-based SFTP client implementations (#1930)
- Fix CLI
agent info
related issues by replacingHardwareMetadata
todict
when class check and adding parameter to default metric value formatter. (#1934) - Change
endpoints.model
andendpoint_tokens.endpoint
to nullable and setondelete="SET NULL"
. (#1935) - Use
buildDate
instead ofbuild
to retrieve web static version to follow lablup/backend.ai-webui#2072 (#1950) - Fix graylog log backend not working when
localname
config is set (#1951) - Fix
endpoint.routings
GQL field showing routing ID instead of status enum (#1952)
- Update Backend.AI Installation & error guide for Pants version 2.18 and later (#1904)
- Replace
passlib[bcrypt]
tobcrypt
which is better maintained (#1932) - Upgrade pyzmq and callosum version to improve malformed packet handling in manager-to-agent RPC channels (#1939)
- Drop the support for nvidia-docker v1 from the open source CUDA plugin (#1755)
- Add a new log handler corresponding to graylog (#1138)
- Pass
manager.api.RootContext
to plugins for easy access to any Manager's context. (#1699) - Implement async compatible graphql relay node object and implement group/user graphql relay nodes. (#1719)
- Use
ui.menu_blocklist
to hide pipeline menu button and deletepipeline.hide-side-menu-button
. (#1727) - Use
ui.menu_blocklist
to hide andui.menu_inactivelist
to disable menu items. (#1733) - Add a
edu_appname_prefix
config on webserver to easily parse image name from app name. (#1735) - GraphQL API log Graphql errors. (#1737)
- Implement model data card query support with metadata parser (#1749)
- In order to be able to use not only alt_name but also field_ref when using the --format option of session list, add values to FieldSet. (#1756)
- Implement the concept of the "main" keypair to make it clear which keypair to use by default and which one holds the user-level resource limits (#1761)
- Add the "update" mode for fixtures (specified as the
__mode
key in fixture JSON files) to update existing tables by matching primary keys and setting other columns as bulk-update values, allowing seamless installation with the newusers.main_access_key
column with split insert and update fixtures on theusers
table (#1785) - Implement the DDN storage backend with quota scope support (#1788)
- Add
vfolder_mounts
field to session field of client's output. (#1811) - Set timeout for Postgres Advisory lock. (#1826)
- Pass the root context to the manager plugins so that they can access database connection pools and other globals (#1829)
- Introduce
endpoint.created_user_email
andendpoint.session_owner_email
GQL field (#1831) - Change default to remove all volumes when execute delete-dev.sh and add "--skip-db" option to skip to remove volumes (#1852)
- Refactor the InvalidImageTag exception to include the full container image name for ease of debugging and error handling. (#1872)
- Add
pool-recycle
config to drop and replace timed-out connections. (#1877)
- Check whether a dependent session has not only succeeded but even terminated. (#1718)
- Minimize latency between session insertion and dependency insertion. (#1720)
- Restrict destroy of terminated sessions. (#1721)
- Improve the installer to use a new default wsproxy port for better compatibility with WSL (#1722)
- Fix the installer to use the refactored
common.docker.get_docker_connector()
for system docker detection which now also detects the active docker context if configured (#1724) - Make root partition filesystem type detection compatible with macOS using psutil (#1728)
- Fix additional installer issues found in a relatively fresher macOS instance (#1731)
- Fix an installer regression in #1724 to inappropriately cache an aiohttp connector instance used to access the local Docker API (#1732)
- Change the Redis port number in the webserver conf for
install_dev.sh
installation. (#1736) - Do null-check
rate_limit
value when validate user's rate limit. (#1738) - Fix some trafaret type checkers of redis config from
Float()
toToFloat()
. (#1741) - Update the open source version of the CUDA plugin to work with latest NVIDIA container runtimes (#1755)
- Change type name of AsyncNode to Node since React's Relay compiler use it to determine relay node. (#1757)
- Initialize the
_health_check_task
attribute of the kernel runner explicitly toNone
for safe access. (#1764) - Remove the
containers
field, which is awkward in table formatsession list
output, from thesession list --format
item. (#1766) - Improve the E2E CLI-based integration tests to work better with multi-user scenarios and updated
undefined
handling of boolean options (#1778) - Add the missing /folder/recover endpoint. Delete the duplicate status field of vfolder. (#1781)
- Fix
modify_user
mutation not working (#1787) - Add a missing
ComputeSession.start_service()
functional API in the client SDK with documentation updates (#1789) - Embed webapp response middleware to parse typed response to
web.Response
. (#1804) - Update the default PATH where the
pants
executable is installed ininstall-dev.sh
(#1806) - Fix an issue in the
ModifyContainerRegistry
mutation where theurl
was not updating due to a key mismatch. (#1810) - Add
id
column and restore incorrectly dropped unique constraints to DB association tables. (#1818) - Exclude unallocated resources from kernel idle utilization checks. (#1820)
- Fix model service health checker reporting invalid healthy status (#1833)
- Fix model service endpoint not updated despite session spawned without error (#1835)
- Fix
vfolder_list
GQL query not returninguser_email
andgroups_name
field (#1837) - Fix mistakes on SQL queries in the manager's vfolder share API handler when checking target user's status and inconsistent where clauses in the vfolder ownership change API (#1850)
- Fix image rescan not working when scanning Harbor v1 registry (#1854)
- Fix double-count issue caused by keypairs belonging to multiple projects (#1869)
- Improve the resource slot validation logic during session creation and related error messages to display explicit slot names and values with an extra guide on the "shmem" mistake (#1871)
- Enqueue session with
use_host_network
field along the scaling_group to which the session belongs. (#1873) - Fix session not created with CentOS 7 based images (#1878)
- Bring
watcher.py
back to Backend.AI Agent wheel (#1880) - Fix inconsistent event names reported when making event source channels for already-completed bgtasks (background tasks), which has caused a stale progress bar UI lingering for bgtask operations that finished too quickly (#1886)
- Refine and elaborate the Concepts section to reflect all the new features and concepts added in last 3 years (#1468)
- Update Backend.AI production installation guide doc (#1796)
- Update the Python development tool versions and restyle the codebase with updated Ruff (0.1.7), replacing Black with Ruff (#1771)
- Replace all usage of
log.warn()
tolog.warning()
since it is now deprecated (#1792) - Update aiohttp to 3.9.1 and workaround mypy
TCPConnector
ssl keyword argument type by add custom typeSSLContextType
(#1855) - Upgrade pantsbuild to 2.19.0 release (#1882)
- Add vfolder purge API for permanent vfolder removal and change original vfolder delete API to update vfolder status only. (#835)
- Support session-based usage stats for period. (#962)
- Add
max_vfolder_count
toProjectResourcePolicy
and migrate the same option toUserResourcePolicy
(#1417) - Refactor initiating logic of model session DB models so that errors while creating the session can be also stored and expressed to user (#1599)
- Check health status of model service actively (#1606)
- expose
max_ipu_devices_per_container
key toconfig.toml
(#1629) - Detailed Docker container creation failure log. (#1649)
- Allow privileged access for other's VFolder to superadmin (#1652)
- Add a
allow_app_download_panel
config to webserver to show/hide the webui app download panel on the summary page. (#1664) - Add a
allow_custom_resource_allocation
config to webserver to show/hide the custom allocation on the session launcher. (#1666) - Allow explicit
null
and empty string to ContainerRegistry mutations. (#1670) - Add
proxy
andname
fields toStorageVolume
graphene object. (#1675) - Add pex + scie based single-file self-contained self-bootstrapping bindary distributions that can be executed on any modern Linux/macOS machines using the standalone Python builds (thanks to @sureshjoshi) (#1680)
- Add the --output option to the function that outputs gql and openapi and unify the output format. (#1691)
- Add
pipeline.frontend-endpoint
andpipeline.hide-side-menu-button
configs to webserver. (#1692) - Add a community installer to replace and upgrade
install-dev.sh
, providing a full GUI experience in terminals (#1694) - Add option to control maximum number of NPU per session (#1696)
- Limit the size of the scratch directory by using loop mounted sparse file (#1704)
- webserver: Include the feature flag
service.is_directory_size_visible
in/config.toml
which provides an option whether to show/hide directory size in folder explorer (#1710) - Implement per-image metadata sync in the
mgr image rescan
command and deprecate scanning a whole Docker Hub account to avoid the API rate limit (#1712)
- Upgrade Graphene and GraphQL core (v2 -> v3) for better support of Relay, security rules, and other improvements (#1632)
- Use the explicit
graphql.Undefined
value to fill the unspecified fields of GraphQL mutation input objects (#1674)
- Wrong exception handling logics of
SchedulerDispatcher
(#1401) - Use "m" as the default suffix if not specified in the resource slots when creating sessions via the client CLI (#1518)
- Update the default API endpoint of the client SDK (
api.cloud.backend.ai
) (#1610) - Fix the mock accelerator plugin to properly set the environment variables without removing existing ones such as
LOCAL_USER_ID
. Also add explicit logging and warning about such situations. (#1612) - Clean up
entrypoint.sh
(our custom container entrypoint), including fixes to avoid non-mandatory recursive file operations on/home/work
(#1613) - Update GPFS storage client for compatibility. (#1616)
- Enable exhaustive search for recursive session termination irrelevant to each session's status (#1617)
- Remove legacy name-based container image exclusion filter to prevent unexpected exclusion of user-built images with names containing "base-" or "common" (#1619)
- Improve logging when retrying redis connections during failover and use explicit names for all redis connection pools (#1620)
- Allow sessions to have dependencies on stale sessions during the
_post_enqueue()
process. (#1624) - Mask sensitive fields when reading the container registry information via the manager GraphQL API (#1627)
- Use
ContainerRegistry.hostname
as ID to provide an unique identifier for each GraphQL node. (#1631) - Allow admins to restart other's session by setting an optional parameter
owner_access_key
. (#1635) - Unify each
project
field in GraphQL typesContainerRegistry
to be a list of string. (#1636) - Update GPFS storage client's Quota API parameters and queries. (#1637)
- Lower limit of maximum available characters to name of model service to fix model service session refuses to be created when service name is longer than 28 characters (#1642)
- To resolve the type mismatch between DB and schema, changed all schema types of
max_vfolder_count
to int. (#1643) - Set deprecation message to
max_vfolder_size
graphene field. Setmax_vfolder_count
andmax_quota_scope_size
graphene fields optional. Update VFolder update API to use renewed column namemax_quota_scope_size
. (#1644) - Handle
None
value of newly created Docker container's port. (#1645) - Move database accessing code to context manager scope (#1651)
- Replace the manager's shared redis config with the common's redis config, as this update is missed in #1586 (#1653)
- Revert #1652 (#1656)
- Fix
backend.ai-agent
package not recognizingbackend.ai-kernel
as dependency when building python package (#1660) - Fix symbolic link loop error of vfolder (#1665)
- Fix
execute_with_retry()
not retrying when DB commit has failed due to incorrect exception handling (#1667) - Update the parameter of session-template update API to follow-up change of session-template create API. (#1668)
- Fix infinite loop when malformed symbolic link exists in container (#1673)
- Restore removed graphene fields of resource policies and set them deprecated. (#1677)
- Allow running manager CLI commands without having
manager.toml
when they do not need it (#1686) - Update all functional wrappers of the Client SDK, CLI commands, and the counterpart Manager GraphQL mutations to distinguish undefined fields and deliberately-set-to-null fields (#1688)
- Allow
Undefined
value ofModifyGroupInput.user_update_mode
field to enable client-py updates group. (#1698) - Handle error in storage proxy's API error handler. (#1701)
- Add missing resource-usage fields. (#1707)
- Allow empty
auto_terminate_abusing_kernel
field from agent heartbeat. (#1715)
- Append new design aligns with revamped backend.ai webpage (#1690)
- Append Heading hierarchy font sizes & flyout menu for selecting en and kr. (#1702)
- Change fonts to webfonts and erase local font files. (#1714)
- Bump base Python version from 3.11.4 to 3.11.6 to resolve potential bugs. (#1603)
- Include
HOME
env-var when running tests via pants (#1676)
This is a test build for the community installer tests.
This is a test build for the community installer tests.
This is a test build for the community installer tests.
This is a test build for the community installer tests.
This is a test build for the community installer tests.
- Add option for roundrobin agent selection strategy (#1405)
- Add health check and manual trigger API for the manager scheduler (#1444)
- Implement VAST storage backend. (#1577)
- Apply the jinja
string
filter to ayarl.URL()
-typed field in webserver.conf to make it serializable (#1595)
- Add a GraphQL query to get the information of a virtual folder by ID. (#432)
- Implement limitation of the number of containers per agent. (#1338)
- Introduce the k8s agent backend mode to
install-dev.sh
with--agent-backend
option (#1526) - Improve the resource metadata API (
/config/resource-slots/details
) to include only explicitly reported resource slots and be able to filter by the agent availability in a resource group (#1589)
- Enable
ResourceSlotColumn
to returnNone
since we need to distinguish between emptyResourceSlot
value andNone
. Alterkernels.requested_slots
column into not nullable since the value of the column should not be null. (#1469) - Update outdated nfs mount for kubernetes agent backend (#1527)
- Collect orphan routings (route which its belonging session is already terminated) (#1590)
- Handle external error of storage proxy to return error response with detail message rather than just leaving it. (#1591)
- Add
pipeline.endpoint
default value toconfigs/webserver/halfstack.conf
to be able to run immediately after install (#1592) - Make
RedisHelperConfig
optional and give default values when it is not specified. (#1593)
- Fix webserver not working (#1588)
- Implement optional encryption of manager-to-agent RPC channels via CURVE asymmetric keypairs using updated Callosum (#887)
- Feature to enable/disable passwordless sudo for a user (work account) inside a compute session (#1530)
- Use
session.max_age
from webserver.conf to set the expiration for the pipeline authentication token (#1556) - Add new config directive
agent.advertised-rpc-addr
under agent.toml so that agent can be operated under NAT situation (#1575) - Add pipeline option to the
config.toml.j2
so that webui can access it. (#1576)
- Fix sentinel connection pool usage and improve Redis sentinel support (#1513)
- Fix a mismatch of the list of session status in the CLI and the manager (e.g., missing
PULLING
in the CLI) (#1557) - Let
RedisLock
retry until it acquires lock. (#1559) - Fix vFolder removal failing due to repeated type casting (#1561)
- Let agents skip mount/umount task and just produce task succeeded event rather than just return. (#1570)
- Resolve
last_used
field ofKeyPair
Gql object from Redis. (#1571) - Fix vFolder bulk deletion to finish any successful deletion task in a bulk. (#1579)
- Fix duplicate logger initialization when using
mgr start-server
command and let the CLI commands to use local logger without relaying log records via ZMQ sockets (#1581) - Remove rows corresponding to
vfolders
not found by storage proxy. (#1582) - Handle redis
LockError
when release. (#1583) - Add new alembic migration to remove mismatches between software defined schema and actual DB schema (#1584)
- Upgrade alembic (1.8.1 -> 1.12.0) to add
alembic check
command for ease of database branch/schema management (#1585)
- Add support for handling OpenID Connect authentication responses (#1545)
- Update both agent socket listener and metadata server to let container check what kind of sandbox it is being executed (#1549)
- Periodically scan and update available slots of all compute plugins (#1551)
- Add
task_name
to allGlobalTimer._tick_task
for better debugging. (#1553)
- Fix storage proxy watcher process always being started event if it is not enabled (#1547)
- Fix install-dev.sh failing to run when both trying to run the script from main branch and Node.js is not installed on the system (#1548)
- Let
RedisLock
raiseLockError
when it fails to acquire a lock instead of skipping lock. (#1554) - Fix metadata server not started with given port number (#1555)
- Hotfix: DB migration failing (#1544)
- Preserve the GlobalTimer tick termination logs in task monitoring. (#1541)
- Fix service not created when trying to use name of already destroyed one (#1539)
- Fix token not stored on database when character count of the token is greater than 1024 (#1540)
- Bump the manager API version to v7.20230615, as it includes a breaking change for quota management APIs (#1375)
- Automate force-termination of hanging sessions, which have been stuck in
PREPARING
orTERMINATING
status for a long period (#670) - Implement
container_pid_to_host_pid()
function (#955) - Add
project
field to Keypair graphene object and cmd, update minilang to query multiple rows from joined tables in one aggregated value. (#1022) - Use case-insensitive matching when applying the query filter for enum-based fields (#1036)
- Introduce the vfolder structure v3 to handle per-user/per-project quota in a more sensible and compatible way (#1191)
- Use
zsh
as the default shell with minimal configs, but including smart auto-completion, when the binary is available in a kernel image. (#1267) - Add basic support for model service (#1278)
- Add failure reason to the CLI login process in case of login failure. (#1305)
- Implement Dummy agent for easy integration test. (#1313)
- upgrade miniling to filter and order by JSON column. (#1334)
- Enable to filter and order by agent id when listing sessions. (#1337)
- Print migration steps as shell script instead of executing migration directly from python script (#1345)
- Issue a signed token to X-BackendAI-SSO header to authorize an user from the pipeline service (#1350)
- Add new GraphQL queries and mutations which can manipulate vFolder quota scope (#1354)
- Add a
directory_based_usage
config on webserver to show/hide Capacity column in each directory in data & storage page in Client-side. (#1364) - Add the OOM event and the details of potentially affected processes explicitly to the container logs for easier inspection for both users and admins (#1373)
- Improve backward compatibility for filtering and querying the agent IDs assigned for a comptue session in the GraphQL API (#1382)
- Add
OptionalType
class as a new parameter type wrapper, allowing the client CLI to manage arguments of theundefined
type. (#1393) - Add more agent selection scheduling strategies (#1394)
- Refactor
SessionRow
ORM queries by introducingKernelLoadingStrategy
to generalize and reuseSessionRow.get_session()
(#1396) - Update the open-source version of CUDA plugin to use CUDA 12.0, 12.1, and 12.2 versions and add missing pretty string representation of CUDA device objects (#1419)
- Add a status-check handler to the storage-proxy's client-facing API (#1430)
- Add new GraphQL queries and CLI commands to support paginated vfolder listing (#1437)
- Support setting the
wsproxy_addr
andwsproxy_api_token
option of scaling group in the client-py. (#1460) - Add manager redis ping command:
./backend.ai mgr redis ping
(#1462) - implement basic
ping_kernel()
API on agent side. (#1467) - Improve logging when the agent fails to allocate resource slots (#1472)
- Add a
max_count_for_preopen_ports
config on webserver to limit the number of sessionpreopen_ports
settings. (#1477) - Allow token login with body parameters, along with previous cookie-based way, by passing body to Manager's authorize handler. (#1478)
- Add support for displaying
preopen_ports
when executingsession info
CLI command. (#1479) - Implement a storage backend that works with a specific proxy API server in Openstack Manila. (#1480)
-
- Update storage proxy to be also eligible as an event producer / dispatcher
- Add event dispatcher at agent (#1481)
- Reduce the start-up delay for inference session containers by deferring the initial health check (#1488)
- Enable to mount volumes on agents and storage proxies through events. Remove kmanila storage backend as it has been migrated to plugins. Implement a storage proxy watcher that is delegated root privileges and executes privileged tasks. (#1495)
- Set the sleep argument of
AsyncRedisLock
to preevnt flooding the Redis server due to a high rate of polling requests (#1501) - Add GraphQL queries to track down generated endpoint tokens (#1509)
- Add a simple storage backend plugin interface to retrieve volume classes from separately install packages (#1516)
- Update
ContainerRegistry
-related mutations to respond with affected node (#1521) - Change to allow webserver to save logs to a file, similar to manager and agents. (#1528)
- Add session show-graph command to visualize session dependencies (#1532)
- Store timestamp of user's last API call date in Unix epoch format on redis (#1533)
- Add filters and touch up on vfolder sharing fail
- Add
is_active
filter on querying from keypair when sharing both user and group(project) vfolder - Touch-up message about handling group folder sharing results to display the failed account list properly. (#1204)
- Add
- Handle buggy ORM field loading when destroy session. (#1312)
- Use a more sensible value for the warning threshold for the number of concurrent generic/read-only transactions within a manager process (#1320)
- Check
None
value of config argument'sresources
key when enqueue session. (#1322) - Fix to check type of
agent_id
strictly when schedule multi-node session. (#1325) - Set session status
PULLING
when any sibling kernel is pulling image. (#1326) - Fix agent refusing to send heartbeat when
public-host
is set (#1332) - Fix some of manager's vFolder API raising error (#1333)
- Update storage proxy's
list_files()
API to only scan files in current directory, instead of scanning recursively (#1335) - Fix vFolder v3 migration script failing (#1336)
- Fix agent not reading available krunner volumes when host's docker has untagged image (#1341)
-
- Resolve regression which
ComputeSessionList
GraphQL query raises HTTP 400 error due to missing conversion of VFolder IDs in the mount history after introduction of Quota Scope IDs, by trying to update kernels and sessions table with appropriate quota scope ID - Update VFolderID validator to also allow null vFolder ID in case of older session data with unknown quota scope ID (#1343)
- Resolve regression which
- Return None for
sessions.status_changed
whensessions.status_history
is None (#1344) - Prevent scanning every sub-directories for listing vfolder files for requests with non-
recursive
option. (#1355) - Enhance vfolder v3 directory migration script. (#1357)
- Add
groups_name
aggregated field in querying keypairs by email or access key to prevent field reference error. (#1358) - Removing trailling comma from container's
service-ports
label. (#1359) - Fix
get_fs_usage()
API reporting capacity as usage and usage as capacity on GPFS and Weka backend (#1376) - Enable to order or filter by
image
when list sessions. (#1378) - Finalize per-kernel scheduling results using the correct kernel IDs. (#1380)
- Avoid returning
NaN
values with undefined capacity and percentage values to prevent calculation errors but just set them zeros. (#1385) - Add
session_name
to aliased key ofsession_name
(#1395) - Allow projcet vfolder creation regardless of the user (keypair) vfolder count limit (#1397)
- Prevent creating/cloning vfolders with duplicate names on different hosts by deleting conditions checking host. (#1398)
- Fix redundant vfolder creation while cloning and avoid checking
max_vfolder_count
when the admin has requested cloning of project type vfolders (#1400) - Fix getting psutil.Process synchronously for catching psutil.NoSuchProcess error leak (#1408)
- Enable transit session status from
PULLING
toCANCELLED
orTERMINATED
. (#1412) - Make the parsing routine of PostgreSQL version strings more robust with additional build tags (#1415)
- Allow storing an empty string (list) in the
project
field of container registry configurations for better compatibility with the GUI behavior and share the same input validation logic in both manager configuration loader andset_config
API (#1422) - Allow termination of a compute session even when the configured wsproxy address is invalid or inaccessible (#1423)
- Update
concurrency_used
by scanning the Redis fully when there is noSession
data. (#1429) - Add shell script codes to setup
version.txt
including vfolder version ininstall-dev.sh
. (#1438) - Ensure the interpretation of the
project
field to be a list when adding/updating container registries, even with empty strings (#1447) - Support CRUD API for container registry using graphQL to deprecate the raw etcd access API from backend.AI WebUI (#1450)
- Add None check to out of scoped variable for correct error response to user. (#1464)
- Add the mininum page size check when paginating in the client CLI (#1465)
- Fix a regression that client-set environment variables were not properly passed to the session containers (#1470)
- Update outdated distro selection algorithm of Kubernetes agent backend (#1474)
- Provides improved logging of delete operations. (#1490)
- Correct null check when migrate
role
column inkernels
table. (#1500) - Fix a regression of unpickling code runner objects when restoring the last-saved kernel registry while restarting the agents (#1502)
- Separate consumer groups of event dispatcher for each service to not intercept other service's event. (#1503)
- Fix drifting of the agent allocation maps due to missing rollback mechanism when there is an allocation failure (e.g.,
InsufficientResource
) (#1510) - Add missing update of the etcd port in
storage-proxy.toml
by theinstall-dev.sh
script (#1514) - Enforce the VFolder
delete_by_id()
handler to validateid
parameter to be an UUID (#1517) -
- Remove rows of sessions table associated with user to purge along with records under tables (error_logs, endpoints) which has foreign key constraint to
sessions.id
- Fix buggy user vfolder fetching query when purging user (#1531)
- Remove rows of sessions table associated with user to purge along with records under tables (error_logs, endpoints) which has foreign key constraint to
- Fix invalid redis key being set when rescanning resource usage (#1534)
- Fix Internal server error (500) raised on situations when Method not allowed (405) should be returned (#1535)
- Improve formatting and trafaret compatibility error reporting of the OpenAPI-based Manager REST API reference (#1452)
- Add predicate-checking plugin hook to enable validate resource request by plugin. (#1454)
- Update the environment setting command in
development-setup
document for verifying the installation (#1463)
- Update etcetra version to 0.1.17 (#1537)
- Due to reduced readability due to numerous decorators, duplicate decorators are integrated and managed, and related modules are separated into
session
subpackages. (#537) - Bump the base Python version from 3.11.3 to 3.11.4 to resolve potential upstream bugs (#1431)
- Auto-enable
--editable-webui
option when runninginstall-dev.sh
from the main branch to ensure the latest version of it (#1441) - Add
--show-guide
option toinstall-dev.sh
for redisplaying the after-setup instructions (#1442) - Replaced Flake8 and isort with Ruff for faster linting and formatting (#1475)