EPIC: Safe throttling for Out of Space, log overflow, small blob overflow at VDisk #12510

the-ancient-1 · 2024-12-11T13:31:50Z

Safely and predictably handle Out of Space situations, log overflow, and accumulation of excessive amounts of small blobs in the index.

VDisks under load should not enter a state where nothing can be done without involving the operations team armed with dstool. A tablet on a full disk should maintain the ability to delete data.

Work Plan (starting December 5)
2d - December 9

Learn to generate load that creates problematic situations (compaction in circles, many small blobs)

5d - December 16

Implement throttling of incoming PUT requests at VDisk, controlled through ICB #12515

3d - December 19

Determine good enough throttling parameters and thresholds #12651

3d - December 24

Create graphs, settings, logs about throttling and small blobs #12820

2d - December 26

Write RFC describing how and why operations will be throttled and halted, which monitoring graphs to observe, and share this RFC with the NBS team beforehand to prepare them for potential alerts and performance impacts #13084

4d - January 10

Write documentation about configuration

2d - January 14

Ensure alert appears for on-duty engineer / show NBS what to monitor

3d - January 17

Implement log overflow throttling - monitor log length, control throttler

3d - January 22

Create graphs, settings, logs about the log

4d - January 27

Write configuration documentation

3d - January 30

Implement throttling when approaching Out of Space - monitor remaining space, control throttler #13083

3d - February 4

Create graphs, settings, logs about out of space

4d - February 10

Write documentation about configuration

43d

Definition of Done:

A cluster user can fill up the cluster's disk space using only their application and independently resolve this situation by deleting data.
A cluster user can fill up the cluster with small blobs using only their application and will receive alerts and appropriate throttling.

the-ancient-1 · 2024-12-11T13:43:25Z

Ниже - черновик.
Первая подзадача:

запись большого количества мелких блобов (и вообще отставание компакшн) должна приводить к троттлингу записи мелких блобов, то что происходит троттлинг должно быть явно видно по графикам, веб интерфейсу и логам
наличие слишком большого количества мелких блобов должно приводить к срабатыванию алерта еще до троттлинга

При превышении пороговых количеств гигабайт, используемых для хранения мелких (inplaced)блобов (или чанков индекса?), необходимы:

графики - текущее значение и пороговые
ворнинг
алерт
плавное замедление записи новых мелких блобов, вплоть до полной остановки, можно ограничивать скорость градиентом от честной доли модельной скорости устройства до 0 между парой точек "заполненности".
для всего этого настройки порогов и выключатели через ICB + CMS
нужен RFC описывающий как и для чего будет замедляться и останавливаться работа, на какие алерты можно будет смотреть, чтобы мы этот RFC показали ребятам NBS заранее и они меньше удивились когда у них загорятся алерты, все замедлится и остановится.
альтернативный / дополнительный путь реализации - троттлинг входящей нагрузки на запись в лог, происходящий в условиях, когда компакшн высоких уровней не успевает и копится слишком много чанков полученных в результате компакшна фреша.

Вторая подзадача:

переполнение лога должно приводить к алерту еще до троттлинга
переполнение лога должно приводить к троттлингу нагрузки на VDisk вплоть до полной остановки

Третья подзадача

Приближение к Out of Space должно приводить к троттлингу записи
Пожелтение групп не должно мешать поднятию таблеток и удалению данных, возможно для этого удаление потребуется дополнительно разметить (?)

the-ancient-1 assigned alexd65536 Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Safe throttling for Out of Space, log overflow, small blob overflow at VDisk #12510

EPIC: Safe throttling for Out of Space, log overflow, small blob overflow at VDisk #12510

the-ancient-1 commented Dec 11, 2024 •

edited by alexd65536

Loading

the-ancient-1 commented Dec 11, 2024

EPIC: Safe throttling for Out of Space, log overflow, small blob overflow at VDisk #12510

EPIC: Safe throttling for Out of Space, log overflow, small blob overflow at VDisk #12510

Comments

the-ancient-1 commented Dec 11, 2024 • edited by alexd65536 Loading

the-ancient-1 commented Dec 11, 2024

the-ancient-1 commented Dec 11, 2024 •

edited by alexd65536

Loading