Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Can we use telemetry to track user selection in the revamped "new" flow? #2522

Closed
amandakys opened this issue Apr 17, 2023 · 9 comments
Assignees
Labels
Component: CLI Issue/PR that addresses the CLI for Kedro

Comments

@amandakys
Copy link

amandakys commented Apr 17, 2023

Description

With the new utility modules and template flow for Kedro (#2388) users will be able to choose to add utilities, specific templates and examples. We would like to be able to track which options users select.

The aim of this spike is to figure out how we can track this.

To explore

  • Can we use kedro-telemetry? At the moment telemetry is only available after requirements.txt is installed. If we asked for permission to collect data when kedro is installed, when we could collect data about what options are used on project creation
  • What are the downside of adding kedro-telemetry as a dependency to Kedro? When a user uninstalls they're likely to get a pip error about missing dependencies. Anything else?
  • How would the user perception of Kedro change if kedro-telemetry is a dependency?
  • The opt-out flow has to be really robust and always work. e.g. in CI workflows.
@amandakys amandakys added the Issue: Feature Request New feature or improvement to existing feature label Apr 17, 2023
@merelcht
Copy link
Member

To explore:

  • What are the downside of adding kedro-telemetry as a dependency to Kedro? When a user uninstalls they're likely to get a pip error about missing dependencies. Anything else?
  • How would the user perception of Kedro change if kedro-telemetry is a dependency?
  • The opt-out flow has to be really robust and always work. e.g. in CI workflows.

@yetudada yetudada changed the title Consider making telemetry available at project creation Spike: Consider making telemetry available at project creation Sep 1, 2023
@yetudada yetudada added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation and removed Issue: Feature Request New feature or improvement to existing feature labels Sep 1, 2023
@merelcht merelcht changed the title Spike: Consider making telemetry available at project creation Spike: Can we use telemetry to track user selection in the revamped "new" flow? Sep 1, 2023
@merelcht merelcht added Component: CLI Issue/PR that addresses the CLI for Kedro and removed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Sep 1, 2023
@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 1, 2023

I confirm that pip uninstall a mandatory kedro dependency does not emit even a warning:

❯ pip uninstall omegaconf                                                                                     (kedro311-dev) 
Found existing installation: omegaconf 2.3.0
Uninstalling omegaconf-2.3.0:
  Would remove:
    /Users/juan_cano/.micromamba/envs/kedro311-dev/lib/python3.11/site-packages/omegaconf-2.3.0.dist-info/*
    /Users/juan_cano/.micromamba/envs/kedro311-dev/lib/python3.11/site-packages/omegaconf/*
    /Users/juan_cano/.micromamba/envs/kedro311-dev/lib/python3.11/site-packages/pydevd_plugins/*
Proceed (Y/n)? y
  Successfully uninstalled omegaconf-2.3.0

Of course it gets flagged by pip check:

❯ pip check                                                                                                   (kedro311-dev) 
kedro 0.18.13 requires omegaconf, which is not installed.

And everytime the user upgrades Kedro, they would get that dependency installed again:

❯ pip install -U kedro                                                                                        (kedro311-dev) 
...
Requirement already satisfied: kedro in ./.micromamba/envs/kedro311-dev/lib/python3.11/site-packages (0.18.13)
...
Installing collected packages: omegaconf
Successfully installed omegaconf-2.3.0

We'd need to investigate how does this affect our conda package.

Despite pip's permissiveness, I still have a bad feeling about this. It's false that kedro-telemetry is required to run Kedro, so putting it in the [project.dependencies] feels wrong. At the same time, echoing what both @amandakys and @merelcht said above, "The opt-out flow has to be really robust and always work", otherwise we'll probably annoy users that are deploying Kedro on Docker images, AWS services, Debian packages, and other places we don't know.

As an alternative solution, if each add-on were a Kedro plugin that is pip-installed, we have PyPI download stats and we don't even need to use the telemetry flow for this.

@noklam
Copy link
Contributor

noklam commented Sep 1, 2023

I am also slightly uncomfortable by putting it in the core dependency, making sure it can run everywhere is proven to be tricky.

On the other hand, I want to point out that kedro new is not completely non-trackable. But I think it's maybe a niche case here because AFAIK it's only tracked if you run kedro new inside a kedro project already.

I created this chart in HEAP: https://heapanalytics.com/app/env/2388822444/graph/chart/CLI-Command-Runs-kedro-new-7690174

image

@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 1, 2023

More crazy ideas: turn kedro into a metapackage depending on kedro-core + kedro-new, and kedro-new has a mandatory dependency on kedro-telemetry. So if people don't want to depend on telemetry, they can always pip install kedro-core instead, at the expense of not having the kedro new command.

Edit: This is a weird dependency tree, and tweaking it (like kedro depending on kedro-core + kedro-new + kedro-telemetry) still makes kedro-telemetry a mandatory dependency, which has the problems described above.

@ankatiyar
Copy link
Contributor

ankatiyar commented Oct 4, 2023

So, I looked through how Kedro and kedro-telemetry work together and created a miro board to make it easier to visualise.
Here's the link!

Background

Kedro hooks

Kedro has these two kinds of hooks that can be used by users/plugins -
image

Two of these are used by kedro-telemetry -> before_command_run and after_context_created

Kedro Telemetry

kedro-telemetry works in two parts -
image

  1. KedroTelemetryCLIHooks uses the before_command_run hook to send the information about commands run by the user to Heap.

At this point, I spotted two potential bugs/problems -

  • 🐞 The way masking works in telemetry is, it imports KedroCLI from the kedro package. Uses that to build a vocabulary of sorts. So when it receives a command that triggered this hook -> ['run', '--pipeline', 'pipe1'], it goes through the list and masks things that are not in the dictionary eg. ['run', '--pipeline', '*****']. So if it receives ['run', '--pipeline', 'airflow'], this is not masked properly because "airflow" is a valid kedro command (kedro airflow create with the kedro-airflow plugin) and the word "airflow" is not masked.
  • 🐞 For some reason (I haven't looked into it yet), each command sends two different events to Heap ->
    Screenshot 2023-10-04 at 13 41 54 EDITED TO ADD: This is on purpose!
  1. KedroTelemetryProjectHooks uses after_context_created hook to send project stats like number of nodes, number of pipelines, number of datasets to Heap.

Proposal 1

Problems

  • kedro is a dependency of kedro-telemetry and in its current form, adding kedro-telemetry as a dependency of kedro creates a circular dependency.
  • kedro-telemetry is only installed AFTER 1) a project is created and 2) the user does pip install -r src/requirements within their projects

Solution

  • TLDR: Separate out kedro-telemetry into two packages -
  1. kedro-cli-telemetry - A plugin to track the usage of CLI commands.
    • Should be kedro agnostic, does not depend on Kedro at all.
    • Right now KedroTelemetryCLIHooks only uses Kedro to create a "vocabulary" from Kedro's CLI commands which is used for masking of CLI commands that are run. This can be replaced with a more generic masking algorithm that does not rely on Kedro.
    • Can be installed with Kedro as a dependency and be used to track all CLI commands from the get go.
    • Uses the before_command_run hook.
  2. kedro-project-telemetry - A plugin to track project statistics
    • Depends on Kedro, can be a requirement in the user's src/requirements.txt and be installed after the project is created.
    • Used to track project stats like no. of pipelines, nodes and datasets
    • (Current state of kedro-telemetry - KedroTelemetryCLIHooks)

Pros and Cons

Pros

  • This will address the general problem of the adoption of kedro-telemetry (i.e. It is only installed way down in the workflow)
  • Can track kedro new command usage, also --starters and which starters are used.

Cons

  • This can track the usage of add ons with --addons flag but not necessarily with add-ons selected via the TUI. AFAIK (through discussions with @AhdraMeraliQB and @SajidAlamQB), the selection of add ons through prompt is handled by Cookiecutter. Telemetry would still see just kedro new being used not which add ons were selected.
  • Too convoluted with multiple packages? As a user if I wanted to uninstall telemetry entirely, would I need to uninstall both packages?

Proposal 2

If we just want to track which add-ons are being selected, we could just keep everything as it is and just save what add ons have been selected by the project creator as metadata in the project's pyproject.toml

  • When a project is created either through TUI wizard or --addons flag, just save that to pyproject.toml under [tools.kedro]
[tool.kedro]
package_name = "spaceflights"
project_name = "spaceflights"
project_version = "0.18.11"
add_ons = ["1,2,3"] #or ["Linting", "Documentation"] or ["1", "2", "3"] format can be decided later
  • This info is accessible to before_command_run hooks implementation, simply pass it on to Heap

Pros and Cons

Pros

  • Simple, minimal change

Cons

  • Doesn't address the problem of kedro-telemetry's adoption?

Conclusions

  • I think we should proceed with Proposal 2 for tracking the add ons flow since it is minimal change
  • Address bugs in kedro-telemetry
  • Code in kedro-telemetry could use with some design and refactoring work

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 4, 2023

Thanks a lot @ankatiyar for the super detailed writeup ⭐

I'm very much in favour of Proposal 2 if it's technically feasible. I have a personal interest on revamping how we collect telemetry but there is no urgency for it, and Proposal 2 already addresses the original concern without adding much complexity.

@merelcht
Copy link
Member

merelcht commented Oct 5, 2023

This is fantastic work! Thanks @ankatiyar 🤩 I'm also in favour of proposal 2.

@yetudada
Copy link
Contributor

yetudada commented Oct 5, 2023

I also really, really like proposal 2! Let's do it and then revisit kedro-telemetry another time. This is fantastic work! 🚀

@ankatiyar
Copy link
Contributor

Thank you for the feedback! Closing this in favour of the follow up tickets -
#3129 for implementation on Kedro side
kedro-org/kedro-plugins#373 for implementation on kedro-telemetry side
kedro-org/kedro-plugins#371 for the masking bugfix
kedro-org/kedro-plugins#375 General improvement of kedro-telemetry workflow that we can revisit after 0.19.0. I'll write up stuff that came up while researching this there. (TODO)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: CLI Issue/PR that addresses the CLI for Kedro
Projects
Archived in project
Development

No branches or pull requests

6 participants