-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4196] Plugin Management Refactor #435
base: 2/edge
Are you sure you want to change the base?
Conversation
…run_cmd on Keystore class
…er-change-before-started-set
@@ -34,46 +34,25 @@ class OpenSearchKeystoreError(OpenSearchError): | |||
"""Exception thrown when an opensearch keystore is invalid.""" | |||
|
|||
|
|||
class OpenSearchKeystoreNotReadyYetError(OpenSearchKeystoreError): | |||
"""Exception thrown when the keystore is not ready yet.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is thrown when we try to reach out to the to reload the keys via API and it fails.
Sync charm docs from https://discourse.charmhub.io Co-authored-by: a-velasco <andreia.velasco@canonical.com>
Currently, we are having a lot of time outs in CA rotation testing. Breaking between small and large deployments and having parallel runners will help with that overall duration.
…python-3-12' into DPE-4196-improve-plugin-manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge amount of work on this PR. Nice!
I've been running this branch locally and didn't find issues so far. Left some comments.
class OpenSearchPluginDataProvider: | ||
"""Implements the data provider for any charm-related data access. | ||
|
||
Plugins may have one or more relations tied to them. This abstract class | ||
enables different modules to implement a class that can specify which | ||
relations should plugin manager listen to. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Does this class include large deployment relation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly, but that is the idea. This class exists to abstract the access to the databags from basic plugins. The plugin should be "dumb", i.e. just v basic dataclasses. This class provides the plugins with any relation info.
@@ -452,66 +503,141 @@ def name(self) -> str: | |||
return "opensearch-knn" | |||
|
|||
|
|||
class OpenSearchPluginBackupDataProvider(OpenSearchPluginDataProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would personally remove most of the OpenSearch
prefix on these classes, since they are somewhat redundant and create noise when parsing or searching the file.
elif charm.opensearch_peer_cm.deployment_desc().typ == DeploymentType.MAIN_ORCHESTRATOR: | ||
# Using the deployment_desc() method instead of is_provider() | ||
# In both cases: (1) small deployments or (2) large deployments where this cluster is the | ||
# main orchestrator, we want to instantiate the OpenSearchBackup class. | ||
return OpenSearchBackup(charm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: shouldn't this one also return OpenSearchBackup
when type is DeploymentType.FAILOVER_ORCHESTRATOR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not really care in this case, they all receive instructions via peer-cluster. If we do have a failove,r then that cluster will run another hook later down the road which will call the backup system and this time be the "OpenSearchBackup" class.
raise OpenSearchPluginMissingConfigError( | ||
"Plugin {} missing: {}".format( | ||
"Plugin {} missing credentials".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is not informative enough. When integrating with s3, you can have both access_key
and secret_key
set on s3 integrator and still getting this message. After setting those, and adding eg bucket
you get the more informative message from below that shows the missing fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, that message continues there, check the if right after:
if self.dp.is_main_orchestrator:
This IF ensures we only do this more thorough check if we are the main orchestrator. And that makes sense, as the main orchestrator is the one that first interacts with the S3. The remaining will only receive the data IF you have passed via peer relation, which will only happen if you had all the contents in the first place.
and self._charm.health.get() | ||
in [HealthColors.GREEN, HealthColors.YELLOW, HealthColors.IGNORE] | ||
) | ||
def is_ready_for_api(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a design opinion here, but these checks should probably be centralized at some point. They are invoked from outside this file, which means that it is a relevant check for other components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmraul can you give more precise references? AFAIU other places are asking if the opensearch is healthy, I am asking if it is responsive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean the ready_for_api
right? This is a proxy method to check the health actually. So, if we received an answer and the cluster is minimally healthy, then we can proceed.
except OpenSearchCmdError as e: | ||
if "not found" in str(e): | ||
logger.info(f"Plugin {plugin.name} to be deleted, not found. Continuing...") | ||
return False | ||
raise OpenSearchPluginRemoveError(plugin.name) | ||
return True | ||
|
||
def _clean_cache_if_needed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not sure why this is needed. Clearing the cache on self.plugins
will lead to ConfigExposedPlugins
being read again, which is static, so no benefit to clearing. self.plugins
is being used as a read_only
property as far as I can see.
For _installed_plugins
I would make that a @property
instead, and just evaluate every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, makes sense. I will remove that method and move the _installed_plugin to be managed as a property. The real cache we need to watch for is the cluster_config, which is deleted somewhere else.
Removed this method.
|
||
def run(self) -> bool: | ||
"""Runs a check on each plugin: install, execute config changes or remove. | ||
|
||
This method should be called at config-changed event. Returns if needed restart. | ||
""" | ||
is_manager_ready = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Shouldn't this method be gated by is_ready_for_api
? It is calling API commands on apply
and it's called from base-charm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is more a naming issue. The idea here was more is_keystore_ready
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name, check if that makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pedro. I left some comments.
It seems to me that we are making the plugin components too smart as opposed to the previous implementation, which I believe was clearer and had better separation of concerns.
The large deployments workflow needs to be carefully analyzed
protocol: Optional[str] = None | ||
storage_class: Optional[str] = Field(alias="storage-class") | ||
tls_ca_chain: Optional[str] = Field(alias="tls-ca-chain") | ||
credentials: S3RelDataCredentials = Field(alias=S3_CREDENTIALS, default=S3RelDataCredentials()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you excplain the default
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing configs from the s3-integrator
@@ -220,48 +226,50 @@ def __init__(self, charm: "OpenSearchBaseCharm", relation_name: str = PeerCluste | |||
]: | |||
self.framework.observe(event, self._on_s3_relation_action) | |||
|
|||
def _on_secret_changed(self, event: EventBase) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be marked as abstract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, we actually call this method if we do not know yet what deployment-description we will get but we did get a secret changed. In this case, we call and pass right away.
This is not an abstract.
# Defaults to True if we have a failure, to avoid any actions due to | ||
# intermittent connection issues. | ||
logger.warning( | ||
"_is_restore_in_progress: failed to get indices status" | ||
" - assuming restore is in progress" | ||
) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this path, why return true and assume a restore is in progress when it may not be, instead of let it crash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the exception handling so we can add an error log here. I will reraise this exception however.
MANDATORY_CONFS = [ | ||
"bucket", | ||
"endpoint", | ||
"region", | ||
"base_path", | ||
"protocol", | ||
"credentials", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment on duplication, pydantic should already perform the required validation.
"protocol", | ||
"credentials", | ||
] | ||
DATA_PROVIDER = OpenSearchPluginBackupDataProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a class variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are very tightly coupled. The idea is to have each subclass defining its own provider type. To simplify it, I am stating a class and then the upstream classes creating the plugin do not need to worry to create 2x objects instead of one.
""" | ||
|
||
MODEL = S3RelData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a class variable?
self.charm.secrets.put_object( | ||
Scope.APP, | ||
S3_CREDENTIALS, | ||
S3RelDataCredentials().to_dict(by_alias=True), | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case we are missing s3 information, then we just create an empty object to signal that.
) | ||
) | ||
|
||
if self.dp.is_main_orchestrator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too smart for the plugins - which is supposed to be dumb. The heavy lifting should have been on the data provider and only inject to the plugin what it needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. But the thing here is that I need some "factory" method that differs between MAIN or OTHERS...
There were several changes on the config-changed logic and made this rebase rather complex. I am making a PR to the main refactor branch so we can look at it more carefully before having it all together.
- If OpenSearch is throttling, this is an alert that optimizations are necessary like scaling the number of nodes or changing queries and indexing patterns
The main goal is to: (1) separate API call (/_cluster) between plugin-specific calls and cluster settings; (2) curb the requirements for restart for any plugin changes; (3) get a faster response time using cached entries wherever possible; (4) add new use-case with secret handling within the plugin logic itself; and (5) define models so we can standardize the plugin data exchanged between the objects and via relation.
Dev experience
The idea is to make it easier to add management for separated plugins without impacting the rest of the code. A dev willing to add a new plugin must decide: do we need to manage a relation or just config options on the charm?
If not, then we can add a config-only plugin
If yes, then we will need new a new object to encapsulate the plugin handling: the "DataProvider"
The config-only plugin
These are plugin configured via config options. In this case, it is only needed to add a new OpenSearchPlugin -child class that manages the config options to be added or removed from the cluster.
For example, opensearch-knn receives the config from the charm and returns the options to be set in the opensearch.yml.
The relation-based plugins
These plugins are more elaborate, as they have to process events specific to the given plugin. We also must consider the case of large deployments, where data may come via dedicated relation or the peer-cluster relation.
These plugins should be managed by a separate entity, named the relation manager. Defining a common structure for the relation manager is outside of the scope of this PR.
For example, repository-s3 and OpenSearch backup.
New Plugin Manager Infra
Now, the plugin manager is able to manage plugins that depend on config options, API calls and secrets. Whenever adding a new plugin, we should consider:
opensearch_plugins.py: this plugin should have a representation that is consumable by plugin_manager; it should be composed of all the configurations and keys to be added or removed to the cluster's main configuration
opensearch_plugin_manager.py: add the new plugin to the plugin dict; the manager must be able to instantiate this new plugin
opensearch_{plugin-name}.py: if the plugin is managed by a given relation, this lib will implement the relation manager and interface with OpenSearch's plugin-specific APIs
models.py: add any relation data model to this lib
Using the new plugin data provider
While specific classes takes care of the plugin's API calls (e.g. /_snapshot API for the backup plugin is done by OpenSearchBackup class), the data provider facilitates the exchange of relation data between the specific class and the plugin_manager itself. This way, the plugin manager can apply any cluster-wide configurations that are needed for that plugin.
We need a class to do deal with relation specifics as some plugins may expect different relations depending on their deployment description, e.g. OpenSearchBackupPlugin. The OpenSearchPluginDataProvider encapsulates that logic away from the main plugin classes.
Secret Management
Each plugin that handles the specific secrets must implement the secret management logic in its operation. The goal is to avoid filling the opensearch_secrets.py methods with ifs for each plugin case and separating / isolating each plugin code.
Remove unneeded restarts and add caching
We ensure that any configuration changes that come from plugin management are applied via API before being persisted on config files. If the API responds with a 200 status, then we should only write the new value to the configuration and finish without a need for restart.
In case the service is down and API is not available, we can assume we will eventually start the service back up. In this case, it suffices to write the config entries to the files and leave to the next start to pick them up.
This task is going to be divided into 3x parts:
Addresses low ranging fruits where we reduce the number of restarts and add caching support
Three main actions: (i) Merge {add,delete}_plugin together and its equivalents in OpenSearchPluginConfig class; (ii) we receive one big dictionary where a key: None means we want to simply delete that entry; and (iii) the main OpenSearchKeystore must observe secret changes and update its values accordingly
Returns unit tests: this is going to be commented out whilst Parts 1 and 2 happen, given this part of the code was covered with extensive testing
The current implementation of plugin_manager.run waits for the cluster to be started before processing its config changed. We relax this demand and open to the option where the cluster is not yet ready, so we can modify the configuration without issuing a restart request.
#252 is closed with OpenSearchPluginRelationsHandler interface. It allows plugins to define how they will handle its relation(s). opensearch_backup module extends this interface and defines a checker to process either small or large deployments details.
Other relevant changes:
Closes #252, #280, #244