Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite _describe_git in KedroSession to use python based implementation of the git commands #1451

Closed
1 task
noklam opened this issue Apr 20, 2022 · 5 comments

Comments

@noklam
Copy link
Contributor

noklam commented Apr 20, 2022

https://github.com/FriendCode/gittle seems to be the easiest option, although it hasn't been maintained for years and it's probably not a good idea to rely on a seemingly abandoned project. Another option is to implement the two functions we need ourselves as per this example here: https://wyag.thb.lt/#orgd8ab0ff (which is more fun, but probably harder, especially for the git status check).

Background

Traditionally we have been using git within kedro run to track metadata about the working directory. The current way we do is using subprocess which calls the git executable directly. However, it causes some issues when we are in some read-only / Databricks's dbfs / spark-submit job.

The idea is to use native python calls to eliminate the need for an external process.

Todo:

  • Find alternatives (open-source library or re-implement the 2 functions we need)
    Possible solutions

Use Case

  • _describe_git
    def _describe_git(project_path: Path) -> Dict[str, Dict[str, Any]]:
    project_path = str(project_path)
    try:
    res = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"],
    cwd=project_path,
    stderr=subprocess.STDOUT,
    )
    git_data = {"commit_sha": res.decode().strip()} # type: Dict[str, Any]
    git_status_res = subprocess.check_output(
    ["git", "status", "--short"],
    cwd=project_path,
    stderr=subprocess.STDOUT,
    )
    git_data["dirty"] = bool(git_status_res.decode().strip())
    # `subprocess.check_output()` raises `NotADirectoryError` on Windows
    except (subprocess.CalledProcessError, FileNotFoundError, NotADirectoryError):
    logging.getLogger(__name__).warning("Unable to git describe %s", project_path)
    return {}
    return {"git": git_data}
    • ["git", "rev-parse", "--short", "HEAD"]
    • ["git", "status", "--short"],
@idanov
Copy link
Member

idanov commented Apr 20, 2022

Unfortunately gitpython will not fit the bill as it needs the git executable, which is exactly what we want to avoid depending on.

@avan-sh
Copy link
Contributor

avan-sh commented Apr 22, 2022

I'm not sure if this error would still pop up after this change to the function. The fatal error was also in directories without .git as fixed by that change.

If I remember right, the error on read-only projects was also related to not being being able to access info.log.

@noklam noklam removed their assignment May 9, 2022
@antonymilne
Copy link
Contributor

antonymilne commented May 25, 2022

As per @avan-sh, now we catch this error I think changing this is not as high priority as it was before. I guess we still want to do it, just it's not essential to get things working on databricks any more.

FYI @avan-sh the thing about not being able to write to log files on a real-only file system is also fixed now and will be release in 0.18.2.

@avan-sh
Copy link
Contributor

avan-sh commented May 26, 2022

@AntonyMilneQB while I understand this is not a priority anymore, I don't get why this is still needed. It might be that using subprocesses is not be ideal.

@merelcht
Copy link
Member

Closing in favour of #2051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants