Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how we can remove Kedro's dependency on git (for starters) #2051

Closed
jmholzer opened this issue Nov 23, 2022 · 16 comments
Closed

Comments

@jmholzer
Copy link
Contributor

jmholzer commented Nov 23, 2022

Description

Currently, Kedro has a (subtle) dependency on git; it is required to create a new project when using one of the default starters. This is because Kedro imports and uses gitpython, which requires that git is installed on the system.

Context

@stichbury documented that git is a requirement for installing Kedro as part of #1989, @yetudada pointed out that it is not a good idea to require users who may not be familiar with git to install the executable as part of the set-up process for Kedro.

Possible Implementation

Instead of using gitpython (and therefore the git executable) to clone the git repositories, we can use a different protocol, perhaps http via the requests library.

@jmholzer jmholzer added the Issue: Feature Request New feature or improvement to existing feature label Nov 23, 2022
@datajoely
Copy link
Contributor

Cookiecutter exposes a git clone functionality using this API so you can clone starters without using Git. IIRC the only other point we need this is recording the hash in exp tracking.

@idanov idanov added this to the Databricks milestone Jan 27, 2023
@merelcht
Copy link
Member

Related to #1451

@astrojuanlu
Copy link
Member

Has https://www.pygit2.org/ (the Python bindings of https://libgit2.org/) been evaluated?

@merelcht merelcht removed the Issue: Feature Request New feature or improvement to existing feature label Feb 20, 2023
@antonymilne
Copy link
Contributor

tl;dr: maybe I'm missing something but I'm pretty confused by this ticket and don't really understand what (if anything) we should do here. Fully agree that kedro shouldn't require use of git, but at the moment it 99.9% doesn't already, and it's not clear to me whether/how we should solve that 0.1%.

The cookiecutter.vcs.clone function that @datajoely mentions above still does this under the hood with repo_type = "git":

subprocess.check_output(  # nosec
                [repo_type, 'clone', repo_url],
                cwd=clone_to_dir,
                stderr=subprocess.STDOUT,
            )

... and so switching to using this won't help here.

AFAIK we actually don't currently have a hard dependency on git anywhere - the only things that won't work are:

  • the hash in experiment tracking - seems fine to just not have this functionality at all in the case you don't have git, which I believe is what happens now
  • kedro new with a starter on a git repo, which includes built-in starters like pandas-iris etc. It might be nice if this worked for people, but I've never actually heard anyone complain that it didn't work for them so I don't think it can be at all common. At a glance of the cookiecutter source I think a user without git could easily work around it by just entering the address of the URL to the zip file rather than to the repo anyway (see cookiecutter.determine_repo_dir)

As far as I can see the only place where we actually import gitpython is in starters._get_available_tags (with comment Not at top level so that kedro CLI works without a working git executable.). We could try and replace that bit with something that doesn't need git so we could remove gitpython as a dependency but it's not exactly a key feature so not sure how much effort should be put into it - maybe even preferable to remove the feature at all instead of trying to come up with a generic non-git solution that does what the current code does. Personally I've never heard anyone complain about this function not working (i.e. number of people without git installed who use this feature must be very small), so I'd just leave it as it is.

@stichbury
Copy link
Contributor

Just to add here that -- should this be resolved and the dependency is removed -- we should update the docs to remove the requirement for git as a prerequisite.

I don't know how many are affected tbh @AntonyMilneQB. I did a full walkthrough of setup and spaceflights on my old windows laptop and it didn't have git installed, so it caught me out. Maybe there will be quite a few of our McK users likewise? We'd like to take out the "accidental complexity" of asking people to install git from the docs, but it's clearly not a high priority to fix a framework ticket just to tidy up the docs a bit!

But should anyone get round to resolving this issue, please let me know, or drop by the Get Started docs and remove the text.

Would this maybe make a suitable community ticket perhaps @astrojuanlu ?

@astrojuanlu
Copy link
Member

I think this would be a suitable community ticket as long as we have a clear consensus on how to move forward, which is not entirely clear to me (but please correct me if I'm wrong).

@merelcht
Copy link
Member

The git dependency comes from cookiecutter. The only way around seems to remove cookiecutter, so we'll need to discuss this in technical design.

It's also unclear at the moment what breaks if you don't have git installed.

  • kedro new --starter=.. with a Kedro starter will break
  • Anything else major?

@merelcht merelcht changed the title Remove Kedro's dependency on git Investigate how we can remove Kedro's dependency on git (for starters) Mar 20, 2023
@merelcht merelcht removed this from the Make Kedro work on Databricks milestone Mar 20, 2023
@stichbury
Copy link
Contributor

Following discussion in planning 20/03/2023

What we're currently saying in the docs about git (following my recent updates to onboarding docs): https://docs.kedro.org/en/latest/get_started/install.html#installation-prerequisites

git: You must install git onto your machine if you do not already have it. Type git -v into your terminal window to confirm it is installed; it will return the version of git available or an error message. You can download git from the official website.

We could say

git: To build Kedro projects pre-populated with example code, you must install git onto your machine if you do not already have it. Type git -v into your terminal window to confirm it is installed; it will return the version of git available or an error message. You can download git from the official website.

However, I think a better approach is, as @AntonyMilneQB suggests, to take git out of the prerequisites list and instead highlight that users need git in the places where we onboard them using a starter:

I think we should do it in each place because there are different paths to these pages and not everyone will work through sequentially. I can do this, but not today.

@astrojuanlu
Copy link
Member

astrojuanlu commented Mar 21, 2023

To note, there was 1 suggestion to cookiecutter in 2016 to replace the git binary by pygit2, however it was rejected because back then precompiled binary wheels were not as widespread and therefore installing libgit2 was a bit of a challenge: cookiecutter/cookiecutter#846

This shouldn't be a problem nowadays, see the amount of wheels: https://pypi.org/project/pygit2/1.11.1/#files

However, cookiecutter is not maintained anymore these days edit: the issue was closed in June as "won't fix", so it's highly unlikely that this change is ever going to happen.

Just in case, I opened an issue in Copier about this: copier-org/copier#1045

@antonymilne
Copy link
Contributor

Fully agree with this @stichbury, thanks for writing it up.

Just for completeness, let me add my other comments here too. If someone still wants to do kedro new --starter=pandas-iris without installing git then it’s already 99% the same to do kedro new -s https://github.com/kedro-org/kedro-starters/archive/0.18.6.zip --directory=pandas-iris. It might be worth mentioning this in the docs as an alternative if you don't want to install git. If there's lots of people who end up trying to do this we could try to make it easier so that just the same kedro new --starter=pandas-iris command would work without git.

But honestly I haven’t heard anyone ask for that before, so I don’t think there would be much call for it. At the moment I think the main problem is just that the docs make it sounds like git is a core dependency when it’s not actually.

@AhdraMeraliQB
Copy link
Contributor

Related: #1826 (comment)

@merelcht merelcht added this to the Starter improvements milestone May 9, 2023
@astrojuanlu
Copy link
Member

Dumping some ideas on how migrating away from cookiecutter could potentially help us on various fronts.

If we were to choose copier it could help with:

Another alternative, cruft, inherits some flaws from cookiecutter (in particular, it can't initialize a template in the current directory) 👎🏽 but on the flip side is compatible with cookiecutter templates (which copier isn't) 👍🏽, also has update capabilities 👍🏽 and has a much nicer CLI (probably based on rich) 👍🏽

@AhdraMeraliQB
Copy link
Contributor

CC'ed for opinions: @merelcht, @astrojuanlu, @stichbury

Discussed in backlog grooming today. @deepyaman, @lrcouto, @noklam, @SajidAlamQB and myself were in favour of closing this PR due to the dependencies of the revamped kedro new flow.

The git dependency comes from cookiecutter. The only way around seems to remove cookiecutter, so we'll need to discuss this in technical design.

Given the recent work using cookiecutter and the kedro-starters repo to populate a new project with example code (when selected by the user), moving away from cookiecutter, and therefore git, would require some severe reworking of the current implementation.

However, I think a better approach is, as @AntonyMilneQB suggests, to take git out of the prerequisites list and instead highlight that users need git in the places where we onboard them using a starter:

Given that kedro new now requires access to kedro-starters even when not using a starter, I think it would make sense to keep git as a prerequisite - it's not just kedro new--starter=<starter> but now also kedro new --example=yes that will require this.

If someone still wants to do kedro new --starter=pandas-iris without installing git then it’s already 99% the same to do kedro new -s https://github.com/kedro-org/kedro-starters/archive/0.18.6.zip --directory=pandas-iris. It might be worth mentioning this in the docs as an alternative if you don't want to install git.

@deepyaman did mention a use case where users may not have access to git, and as such, this workaround would be good to highlight. I support this suggestion to add this to the starters documentation.

@stichbury
Copy link
Contributor

Supported. Please could you make any tickets necessary for updating documentation e.g. where it may not yet be clear that git is a pre-req (we already have it in our docs for the "set up Kedro" section) and any other docs changes you think needed. I'm not sure if adding the workaround for not having git makes sense in the place you've linked it (in starters.md) or maybe in a FAQ -- I think I'd need to see the copy before confirming.

@AhdraMeraliQB
Copy link
Contributor

Closing in favour on #3571

@astrojuanlu
Copy link
Member

Talking to @lrcouto about #3900, we discussed that the way we interact with git repos through cookiecutter might be a tad problematic. Awaiting for her more detailed account of what's happening also in the context of #3884.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

8 participants