Agent Spawner daemon #125

praiskup · 2023-09-06T11:12:46Z

Relates to #123

agentspawner/daemon.py

praiskup · 2023-09-08T11:27:10Z

agentspawner/daemon.py

+                continue
+
+            self.log.debug("Closing ticket %s", ticket.id)
+            ticket.close()


@siteshwar Note that we close the ticket once the call_release() hook finishes with exit status 0

siteshwar · 2023-09-19T14:05:11Z

agentspawner/daemon.py

+            ticket = self.conn.newTicket(self.tags)
+            self.log.debug("Taking ticket id %s", ticket.id)
+            self.tickets.append(ticket.id)
+            data = ticket.wait()


This would cause blocking in linear mode. Can we open a ticket and wait in a separate thread (or process)?

This is truth, though resalloc prepares the resources in advance with "preallocation" mechanism, so doing this in parallel isn't worth the initial hassle IMO. Simply put, the wait() will end up immediately (or at least the subsequent wait()). WDYT?

siteshwar · 2023-09-19T14:11:05Z

agentspawner/daemon.py

+            self.log.debug("Taking ticket id %s", ticket.id)
+            self.tickets.append(ticket.id)
+            data = ticket.wait()
+            self.call_take(data)


I would also need a post_call_take() callback to start the osh worker on the machine.

Or probably I can merge my changes in one and just use call_take() hook.

siteshwar · 2023-09-19T14:16:39Z

The process of starting a new osh worker is as follows:

Request for a new machine.
Wait for it to appear and get its hostname.
Generate an osh worker configuration based on the hostname.
Transfer the configuration to the new osh worker.
Start osh worker daemon.

The last 3 steps should be done after the hostname is known and probably belong to a separate post_call_take() hook.

praiskup · 2023-09-20T06:36:02Z

The last 3 steps should be done after the hostname is known and probably belong to a separate post_call_take() hook.

We can have two hooks, but if they can be logically merged - the API would be simpler.

siteshwar · 2023-09-22T10:45:56Z

agentspawner/daemon.py

+    sleep = 30
+
+    def __init__(self, resalloc_connection, logger):
+        self.tags = ["A"]


This should be customizable through configurations
.

siteshwar · 2023-09-22T10:47:39Z

agentspawner/daemon.py

+            if todo > 0:
+                self.start(todo)
+            elif todo < 0:
+                self.try_to_stop(-todo)


There should be a periodic check here to find virtual machines that can not be used anymore and should be deleted.

Can you elaborate on this one? The try_to_stop hook actually does this?

try_to_stop hook tries to close tickets based on a number, so if the number do not change, the tickets would not close. We would like to have throwaway workers in openscanhub, that is, a worker should be deleted once a task assigned to it has been completed. This would require querying the database and it should be done through a separate hook.

On a sidenote, we should regularly perform health checks on the workers and if a worker ends up in an unusable state, its related ticket should be closed.

Good point. Discussed on Slack 1:1, the ticket needs to be closed for finished agents .... because agents can not close the ticket themselves. We could rely on "FAILED" ticket state (if the VM/resource goes down by itself) but that's not ideal for OSH.

If an agent goes down by itself and the ticket is in failed state, there should be a separate hook for it to remove the agent from osh (kobo) database.

siteshwar · 2023-09-22T10:48:33Z

agentspawner/daemon.py

+    def __init__(self, resalloc_connection, logger):
+        self.tags = ["A"]
+        # TODO: use a persistent storage so we can restart the process
+        self.tickets = []


The tickets being processed by the daemon should be stored in the database to survive service restarts.

Can we dump the list of tickets into a persistent "text" file for now?

It would be fine for the beginning, but I plan to use containers for deployments, so this would have to be fixed using a database before that.

Yes, OK -> for the beginning we can have a persistent volume or a bindmounted file.

siteshwar · 2023-09-22T10:49:46Z

agentspawner/daemon.py

+    def call_converge_to(self):
+        """ Execute the configured hook script """
+        while True:
+            result = subprocess.run(["./hook-converge-to"], capture_output=True,


I had to replace capture_output=True with stdout=subprocess.PIPE to get this script working on a RHEL 8 machine.

Ah, Python 3.7+, thank you for the hint.

siteshwar · 2023-09-22T10:50:25Z

agentspawner/daemon.py

+        """
+        Call hook that prepares the resource
+        """
+        return not subprocess.run(["./hook-take", f"{data}"], check=True)


The hooks should be customizable through a configuration file.

siteshwar · 2023-09-22T10:55:56Z

agentspawner/daemon.py

+        """
+        while True:
+            start = time.time()
+            todo = self.call_converge_to() - len(self.tickets)


We probably need a check here so that number of tickets opened should not be above maximum number of machines allowed in a pool.

The set of resources may span across several pools so this is a hard task to do ... can we implement this later?

If the converge-to hook returns a number greater than number of running machines, the recycle hook would never be called and all the running machines would be in an unusable state.

The self.recycle() is called everytime.

This code synchronously waits for tickets, so if the converge-to hook returns a number greater than running nodes, recyle hook would never be called.

Indeed -> it is better to perform the preparation async. Noted. I think I'm going to start using python-copr-common logic for dispatcher/background worker which depends on Redis, is that OK? While depending on Redis, I think we can save the list of opened tickets there, too.

@kdudka The hub and resalloc spawner daemon should be running on the same node. Would it cause any problems to have a dependency on redis?

Redis can be hosted in a different container. We just need the runtime dependency on redis client (python3-redis).

siteshwar · 2023-09-25T13:43:58Z

agentspawner/daemon.py

+            self._drop_ticket(ticket)
+            stopped += 1
+
+    def recyle(self):


There is a typo here and it should be renamed to recycle.

siteshwar · 2023-09-25T13:54:05Z

agentspawner/daemon.py

+    Ticket,
+)
+
+RESALLOC_SERVER = "http://localhost:49100"


This should be customizable through configurations.

siteshwar · 2023-09-25T14:51:07Z

agentspawner/daemon.py

+
+        for ticket_id in list(self.tickets):
+            ticket = Ticket(ticket_id, connection=self.conn)
+            data = ticket.collect()


What would this call return?

Shall we pass ticket.output to the throwaway hook? I would expect to get ip address of the machine from there.

The data that is printed on stdout by cmd_alloc in resalloc. Whatever you dump there you get here.

siteshwar · 2023-09-25T15:08:33Z

agentspawner/daemon.py

+
+            ticket = Ticket(ticket_id, connection=self.conn)
+            data = ticket.collect()
+            if not self.call_release(data):


We should pass ticket.output here.

Ah, OK -> if it is ticket.output then yes.

siteshwar · 2023-09-25T15:20:26Z

agentspawner/daemon.py

+            if not self.call_throwaway(data):
+                continue
+
+            self._drop_ticket(ticket)


Shall we call the release hook here? Shall the throwaway hook only check if a node can be thrown away or delete it too? That may cause duplication of code.

I don't think we want to call release hook on two places. If something needs to be done on both places, then you can appropriately tweak both the configurable release and throwaway hooks.

What is the difference between release and throwaway hooks? They seem to be doing the same thing under different names.

It should be actually named try_release vs. trhowaway, or something like that.

The first one removes even workers who have not yet started any tasks. While throwaway removes workers that already finished at least one task.

github-advanced-security

vcs-diff-lint found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

resalloc_agent_spawner/dispatcher.py

resalloc_agent_spawner/helpers.py

resalloc_agent_spawner/worker.py

There doesn't seem to be ideal way to handle exceptions in XMLRPC. The previous code just raised Fault() error which we mishandled client side with the `survive_server_restart` feature (non-existing ticket id led to indefinite retries).

resalloc/client.py

praiskup · 2023-10-27T01:13:26Z

@siteshwar what do you think about the current variant? It seems to be working
with this configuration:

agent_groups:
  osh_workers:
    cmd_prepare: /bin/true
    cmd_terminate: echo noop
    cmd_converge_to: echo 5
    cmd_check_finished: exit $(( RANDOM % 15 ))
    cmd_try_release: exit $(( RANDOM % 5 ))
    tags:
      - kobo_worker

Of course you must be running resalloc server on a normal port, and have some
pool defind with kobo_worker tag.

Relates to #123

resalloc_agent_spawner/dispatcher.py

+from copr_common.log import setup_script_logger
+from copr_common.redis_helpers import get_redis_connection
+
+from resalloc.client import (


resalloc_agent_spawner/helpers.py

+import os
+import subprocess
+
+from resalloc.helpers import load_config_file


praiskup marked this pull request as draft September 6, 2023 11:13

github-advanced-security bot found potential problems Sep 6, 2023

View reviewed changes

agentspawner/daemon.py Fixed Show fixed Hide fixed

praiskup mentioned this pull request Sep 6, 2023

Agent Spawner daemon #123

Closed

github-advanced-security bot found potential problems Sep 6, 2023

View reviewed changes

agentspawner/daemon.py Fixed Show fixed Hide fixed

praiskup commented Sep 8, 2023

View reviewed changes

siteshwar reviewed Sep 19, 2023

View reviewed changes

siteshwar reviewed Sep 22, 2023

View reviewed changes

siteshwar reviewed Sep 25, 2023

View reviewed changes

github-advanced-security bot found potential problems Oct 10, 2023

View reviewed changes

siteshwar mentioned this pull request Oct 17, 2023

Shall we use resalloc to dynamically provision workers? openscanhub/openscanhub#82

Closed

praiskup force-pushed the praiskup-agent-spawner branch from 167d907 to b51821b Compare October 21, 2023 20:39

github-advanced-security bot found potential problems Oct 21, 2023

View reviewed changes

praiskup force-pushed the praiskup-agent-spawner branch 3 times, most recently from b7dc13d to 172001e Compare October 24, 2023 00:38

praiskup force-pushed the praiskup-agent-spawner branch from 172001e to d259af8 Compare October 27, 2023 01:10

praiskup marked this pull request as ready for review October 27, 2023 01:10

github-advanced-security bot found potential problems Oct 27, 2023

View reviewed changes

resalloc/client.py Fixed Show fixed Hide fixed

praiskup changed the title ~~PoC: agent spawner~~ Agent Spawner daemon Nov 6, 2023

praiskup added 2 commits November 6, 2023 10:04

client/server: propagate "FAILED" ticket state to cli

a5f768e

New agent spawner daemon in resalloc-agent-spawner.rpm

736c57e

Relates to #123

praiskup force-pushed the praiskup-agent-spawner branch from d259af8 to 736c57e Compare November 6, 2023 09:05

github-advanced-security bot found potential problems Nov 6, 2023

View reviewed changes

praiskup merged commit 871b629 into main Nov 6, 2023
2 of 3 checks passed

praiskup deleted the praiskup-agent-spawner branch January 10, 2024 15:59

Agent Spawner daemon #125

Agent Spawner daemon #125

Conversation

praiskup commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siteshwar commented Sep 19, 2023

praiskup commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-advanced-security bot left a comment

Choose a reason for hiding this comment

praiskup commented Oct 27, 2023