Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry): Run telemetry outside of kedro projects #775

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 26 additions & 22 deletions kedro-telemetry/kedro_telemetry/plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,25 +168,22 @@ def before_command_run(
):
"""Hook implementation to send command run data to Heap"""

if not project_metadata: # in package mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we also track packaged mode now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that line isn't related to package mode:
#729 (comment)

As I understand it, in package mode, execution will occur without the CLI. Am I correct?

Copy link
Member

@merelcht merelcht Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A packaged project will use the __main__.py as entrypoint. It will then look for the run method to execute.

I think you're right this line here in telemetry actually isn't related to package mode.. It seems to only be reached if you're not in a kedro project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packaged mode will execute the CLI through the entrypoint. I believe the commend about package mode because it was assumed that it is the only situation that project_metadata will be missing. This is not exactly true now because it would be either:

  1. packaged mode
  2. kedro command outside of project (i.e. kedro new)

I believe bot situations should be handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @noklam. Could you please provide more details about packaged mode? From what I see in the code, if I package a Kedro project, it will have an entry point project_name.__main__:main. When I execute project_name in the terminal, the kedro/framework/cli/project.py:run() function will be executed directly, without calling the before_command_run() hook. This is because that hook is only called in the KedroCLI() class during Kedro CLI execution. I'm trying to understand what we need to handle in the telemetry hooks code concerning packaging mode.

return
project_path = project_metadata.project_path if project_metadata else None

self._consent = _check_for_telemetry_consent(project_metadata.project_path)
self._consent = _check_for_telemetry_consent(project_path)
if not self._consent:
return

# get KedroCLI and its structure from actual project root
cli = KedroCLI(project_path=project_metadata.project_path)
cli = KedroCLI(project_path=project_path if project_path else Path.cwd())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is, does that project_path here actually matters? Where is it used as the command_args are coming from hook and KedroCLI which was created in kedro framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also trying to understand why KedroCLI is dependent on a specific Kedro project. I believe the CLI structure should be the same for all projects. Do you think we can just use Path.cwd() here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the project_path matters but we should add a comment to explain this. I have questions about the _recurse_cli used by the function, does it takes care of the LazyCommand we added lately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions about the _recurse_cli used by the function, does it takes care of the LazyCommand we added lately?

I checked the Lazy Loading PR, looks like _recurse_cli() func should work as previously. Also I tested current PR manually with latest kedro main branch on fresh virtual env, everything works well from the 1st command.

Copy link
Contributor

@noklam noklam Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig into this a bit and I think project_path is not an arbitrary value here and it matters.

The output of recurse_cli change when I provide different value of project_path (mlflow etc is missing), they are not important here but I am not sure if this will trigger the imports.
image

When project_path is set as root correctly:
image

I am also concerned would this revert the effort of LazyCommand since this seems to try to load the CLI group (hopefully not the dependencies). How could we test this properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Nok is right, the recursion through the CLI definitely structure triggers the loading of lazy the CLI command groups -
It uses get_command() :

cli_element.get_command(ctx, command_name),

Which if you look at the implementation of the LazyGroup, triggers the lazy load:
https://github.com/kedro-org/kedro/blob/8bbbfb6c7257005bef841143690308f05d4b829b/kedro/framework/cli/utils.py#L504-L509

I believe we need to find a way to do the masking without loading relying on KedroCLI to not defeat the purpose of introducing lazy loading

cli_struct = _get_cli_structure(cli_obj=cli, get_help=False)
masked_command_args = _mask_kedro_cli(
cli_struct=cli_struct, command_args=command_args
)

self._user_uuid = _get_or_create_uuid()

event_properties = _get_project_properties(
self._user_uuid, project_metadata.project_path / PYPROJECT_CONFIG_NAME
)
event_properties = _get_project_properties(self._user_uuid, project_path)
event_properties["command"] = (
f"kedro {' '.join(masked_command_args)}" if masked_command_args else "kedro"
)
Expand Down Expand Up @@ -221,7 +218,7 @@ def after_catalog_created(self, catalog):

if not self._event_properties:
self._event_properties = _get_project_properties(
self._user_uuid, self._project_path / PYPROJECT_CONFIG_NAME
self._user_uuid, self._project_path
)

project_properties = _format_project_statistics_data(
Expand Down Expand Up @@ -265,12 +262,16 @@ def _is_known_ci_env(known_ci_env_var_keys: set[str]):
return any(os.getenv(key) for key in known_ci_env_var_keys)


def _get_project_properties(user_uuid: str, pyproject_path: Path) -> dict:
project_id = _get_or_create_project_id(pyproject_path)
package_name = PACKAGE_NAME or UNDEFINED_PACKAGE_NAME
hashed_project_id = (
_hash(f"{project_id}{package_name}") if project_id is not None else None
)
def _get_project_properties(user_uuid: str, project_path: Path | None) -> dict:
if project_path:
pyproject_path = project_path / PYPROJECT_CONFIG_NAME
Comment on lines +266 to +267
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overwriting the supplied project_path here I'd just create a new variable to make it easier to follow what happens and debug if something is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's already a new variable?

project_id = _get_or_create_project_id(pyproject_path)
package_name = PACKAGE_NAME or UNDEFINED_PACKAGE_NAME
hashed_project_id = (
_hash(f"{project_id}{package_name}") if project_id is not None else None
)
else:
hashed_project_id = None

properties = {
"username": user_uuid,
Expand All @@ -282,7 +283,8 @@ def _get_project_properties(user_uuid: str, pyproject_path: Path) -> dict:
"is_ci_env": _is_known_ci_env(KNOWN_CI_ENV_VAR_KEYS),
}

properties = _add_tool_properties(properties, pyproject_path)
if project_path:
properties = _add_tool_properties(properties, pyproject_path)

return properties

Expand Down Expand Up @@ -344,21 +346,23 @@ def _send_heap_event(
)


def _check_for_telemetry_consent(project_path: Path) -> bool:
def _check_for_telemetry_consent(project_path: Path | None) -> bool:
"""
Use telemetry consent from ".telemetry" file if it exists and has a valid format.
Telemetry is considered as opt-in otherwise.
"""
telemetry_file_path = project_path / ".telemetry"

for env_var in _SKIP_TELEMETRY_ENV_VAR_KEYS:
if os.environ.get(env_var):
return False
if telemetry_file_path.exists():
with open(telemetry_file_path, encoding="utf-8") as telemetry_file:
telemetry = yaml.safe_load(telemetry_file)
if _is_valid_syntax(telemetry):
return telemetry["consent"]

if project_path:
telemetry_file_path = project_path / ".telemetry"
if telemetry_file_path.exists():
with open(telemetry_file_path, encoding="utf-8") as telemetry_file:
telemetry = yaml.safe_load(telemetry_file)
if _is_valid_syntax(telemetry):
return telemetry["consent"]
return True


Expand Down