Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

artifacts: Review naming restrictions #9821

Closed
daavoo opened this issue Aug 8, 2023 · 30 comments · Fixed by iterative/gto#427
Closed

artifacts: Review naming restrictions #9821

daavoo opened this issue Aug 8, 2023 · 30 comments · Fixed by iterative/gto#427
Assignees
Labels
A: artifacts Related to `artifacts` section in `dvc.yaml` p2-medium Medium priority, should be done, but less important

Comments

@daavoo
Copy link
Contributor

daavoo commented Aug 8, 2023

The current name restriction seems too limiting (and the reasons for the current restrictions are not clear cc @aguschin ) :

NAME = r"[a-z0-9]([a-z0-9-/]*[a-z0-9])?"

It has come up in different places:

https://iterativeai.slack.com/archives/CB41NAL8H/p1691411376466919
https://github.com/iterative/studio/issues/6897#issuecomment-1670034865

I think we should first clarify the reasons for the restrictions and then decide on allowing additional cases


Edit: go to #9821 (comment)

@daavoo daavoo added discussion requires active participation to reach a conclusion A: artifacts Related to `artifacts` section in `dvc.yaml` labels Aug 8, 2023
@dberenbaum
Copy link
Collaborator

@aguschin @omesser Do you have any more background on why we choose these limitations for artifact names? I recall trying to be compatible with git tags, but we are being more restrictive now than what git tags allow (no uppercase letters, _, ., etc.)

@omesser
Copy link
Contributor

omesser commented Aug 8, 2023

Couldn't find specific reasons for _ for . actually.
I'm guessing it's a restriction in the artifacts section in dvc.yaml (where these might be keys 😄 ).
@aguschin - had a hard time digging for the reason in GTO issue history, can you take a look please and tell us what prevented underscores, dots and capital letters?

@aguschin
Copy link
Contributor

aguschin commented Aug 9, 2023

That limitation was chosen just for simplicity: better to have it limited and lift it later than to allow everything and decide that we need to remove some characters.

So technically I think we can allow extra characters that don't interfere with gto special signs (@ and # and !) and with git tags limitations. The only other reason I can think of that worth considering is readability.

@dberenbaum
Copy link
Collaborator

@aguschin Are there any restrictions within gto now (besides git tag limitations), or is it only enforced at the dvc level?

@shcheklein
Copy link
Member

@dberenbaum it's restricted on the GTO side AFAIK.

@omesser
Copy link
Contributor

omesser commented Aug 11, 2023

We can change things on GTO side as well IMO if there's no deep reason not to (cc @aguschin)

@dberenbaum dberenbaum self-assigned this Sep 8, 2023
dberenbaum added a commit to iterative/gto that referenced this issue Sep 8, 2023
~Having an issue with tests, but opening this for discussion in the
meantime.~

The PR adds support for uppercase letters and `_`. I decided against
support for `.` because there are some restrictions around how it can be
used in [git refs](https://git-scm.com/docs/git-check-ref-format).

Related: iterative/dvc#9821

TODO:
- [x] Tests
- [ ] DVC PR
- [ ] DVCLive PR to improve
[warning](https://github.com/iterative/dvclive/blob/95000f45638d5126b7d76833050f549167a7449d/src/dvclive/live.py#L462-L466)
- [ ] Docs PR
@dberenbaum
Copy link
Collaborator

Waiting on #9770 so we can use gto as a dep here and not duplicate the regex and other logic in dvc

@tapadipti
Copy link
Contributor

tapadipti commented Oct 4, 2023

Next steps (Updated after further discussions in following comments):

  • Immediate fixes:
  • Later:
    • Remove all naming restrictions (except for those that result in an invalid yaml). Do this in all products (DVC, Studio, GTO). Update validation rules and error messages accordingly.
    • In GTO, before creating Git tag for an action, check if a Git tag with the same name (in any casing) exists. If yes, fail the action.
    • In DVC, disallow having two artifacts in the same dvc.yaml with names that are only different in casing.
    • Make it clear in the UI and docs that models/dirs/stages with the same name but different casing cannot exist in the repo.
    • In GTO, when creating Git tags for model actions, escape characters that aren’t allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.
    • In GTO, for all model actions, run the check on a tag name before creating the tag.
Context

Discussions related to this issue are spread out in multiple GH issues and Slack threads. I’ve tried to summarize them below. Do the next steps make sense?

@dberenbaum
Copy link
Collaborator

  • Allow uppercase letters in model names in Studio and GTO

There is a PR for this here

  • In GTO, when creating/working with Git tags for model actions, convert uppercase letters to lowercase

The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.

  • One item suggested by @shcheklein is “make DVC fail - so that people can’t create (annotate) model with names that we don’t support in GTO”. Do we need this if we remove all restrictions?

No, I think we agreed here that it won't be needed in that case.

@shcheklein
Copy link
Member

The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.

I think if we agree that we escape symbols, I would not do any other transformations. For upper case symbols -if GH accepts is - let's push it. WDYT?

@dberenbaum
Copy link
Collaborator

I see that GH accepts it (see gto-tag and GTO-tag in https://github.com/iterative/lstm_seq2seq/tags), but would have to check BB/GL. More importantly, I'm finding issues in git cli. Sometimes, it seems to accept a new tag with different casing, other times I get an error that the tag already exists:

$ git tag gto-tag
$ git tag GTO-tag
fatal: tag 'GTO-tag' already exists
$ git tag -l
gto-tag

I'm not sure why it's happening, but feels like this is too fragile to depend on.

@pmrowla
Copy link
Contributor

pmrowla commented Oct 5, 2023

I'm not sure why it's happening, but feels like this is too fragile to depend on.

It's because you are on a platform with case-insensitive filesystems (windows or macos). Git ref names are technically case sensitive, and this would work as expected on linux (where the filesystems are case sensitive). But on windows/mac, this may or may not fail, depending on the state of your git repo.

(It fails if git tries to create the files .git/refs/tags/gto-tag, .git/refs/tags/GTO-tag, since on windows/mac both of these filenames point to the same file. However, it can actually succeed on windows/mac in the event that one of those tags is stored in a packed ref instead of in the .git/refs/tags/... file)


This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.

@dberenbaum
Copy link
Collaborator

This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.

My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right? It is pretty unlikely, so maybe just warning about this or catching the exception is enough.

@dberenbaum
Copy link
Collaborator

Let's say I have a tag like My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkout My-Tag, will it fail?

@pmrowla
Copy link
Contributor

pmrowla commented Oct 5, 2023

My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right?

Yes, this is correct (note that "may fail" is significant since it depends on whether or not the existing tag ref is packed)

Let's say I have a tag like My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkout My-Tag, will it fail?

No, git is case sensitive. Git can tell the difference between a tag named My-Tag and my-tag, and considers them to be two different tags. On a linux machine, the git client will always behave as expected.

The issue is specifically with what happens on windows/mac. In the event that you have an unpacked/loose ref file named .git/tags/My-Tag, you cannot create a new tag named .git/tags/my-tag (since windows/mac filesystems will report that the file already exists). Git tag lookups are still case sensitive though, so in the case where you have both a set of packed refs containing the tag My-Tag, and a loose ref file named .git/tags/my-tag, Git will checkout the correct one if you do git checkout My-Tag.

@pmrowla
Copy link
Contributor

pmrowla commented Oct 5, 2023

Because of this weirdness, simplified best-practice for handling git refs in a cross-platform project is to just always use lowercase for branch/tag names.

However, since GTO tags can include actual directory paths, if you wanted to enforce a lowercase-only rule for GTO tags, you would have to extend the rule to require that users only store artifacts in directories with lowercase paths (unless the user only cares about windows/mac).

Given a GTO tag like path/to:artifact#v1.0.0, and an artifact entry stored in Path/To/dvc.yaml:

  • This will work on windows/mac since open('path/to/dvc.yaml') will succeed (path/to/ and Path/To/ are the same dir)
  • This will not work on linux since open('path/to/dvc.yaml') will fail saying the file does not exist (path/to/ and Path/To/ are not the same dir)

Also keep in mind that this gets even worse taking into consideration that directory paths can contain non-ascii characters (i.e. latin characters with accents, Cyrillic, CJK, etc) , and how git stores those characters is also dependent on platform + system encoding settings.

i.e. given something like café/dvc.yaml, on windows/mac/linux git may or may not encode café differently in all 3 cases.

@tapadipti
Copy link
Contributor

Considering the above comments, would this be better?

  • do not change the casing
  • before creating Git tag for an action,
    • check if a Git tag with the same name (in any casing) exists
    • if yes, fail the action (coz the Git tag creation might fail depending on the user's platform)
      This is basically a restriction that models/dirs/stages with the same name but different casing cannot not exist in the repo. We should make this restriction clear in the UI and docs.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Oct 5, 2023

I think that makes sense @tapadipti. Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?

Edit: also, in case it's unclear, this seems like an unlikely scenario, so I don't think it should block us moving forward with allowing uppercase.

@tapadipti
Copy link
Contributor

Discussed this with @dberenbaum a short while ago.

Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?

  • We should implement this change in dvc.
  • It's good to still add the check in gto also. (For the rare case when 2 dvc.yaml files have the same path but different casing, and contain models with the same name.)

I don't think it should block us moving forward with allowing uppercase.

Allowing uppercase is already done.

@tapadipti
Copy link
Contributor

I've updated the next steps in this comment. ptal.

@dberenbaum
Copy link
Collaborator

Let's close this one and track there.

@dberenbaum
Copy link
Collaborator

Can we do this at the same time we remove all naming restrictions? Otherwise, I think it could introduce new errors and be a breaking change when we implement escape characters.

Also, we only need to escape characters disallowed in Git refs, but it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git refs.

@tapadipti
Copy link
Contributor

tapadipti commented Oct 6, 2023

Let's close this one and track there.

The comment with tasks is in this issue itself. So I'm re-opening the issue.

In GTO, when creating Git tags for model actions, escape characters that aren’t allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.

Can we do this at the same time we remove all naming restrictions?

👍

it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git ref

Agreed. I'll search for what library to use. If you have any suggestions, pls share.

@tapadipti tapadipti reopened this Oct 6, 2023
@dberenbaum
Copy link
Collaborator

@shcheklein mentioned URL escaping, and we could use urllib.parse.quote for that.

@tapadipti
Copy link
Contributor

tapadipti commented Oct 10, 2023

@shcheklein You mentioned that GTO "created and pushed the tag before actually running a check on the tag name". I'm looking into the code and I can see that the assert_fullname_is_valid() function is called on the full model name before create_tag() is called. Can you pls share which check you were referring to?

UPDATE: Discussed this with @shcheklein. I tried the register action. The issue could be in some other actions (eg, deprecate).

dberenbaum added a commit to iterative/gto that referenced this issue Oct 10, 2023
Related to iterative/dvc#9821

---------

Co-authored-by: Dave Berenbaum <dave.berenbaum@gmail.com>
@jellebouwman
Copy link

jellebouwman commented Oct 12, 2023

Hello! GTO got upgraded, DVC is using the latest GTO version, and Studio BE depends on the latest DVC version. https://github.com/iterative/studio/issues/7903 incorporates the new validation rules when adding models on the Studio FE. FE changes will be released today.

@dberenbaum
Copy link
Collaborator

@tapadipti Do you plan to start on the "Later" items in #9821 (comment)? Is there anything else needed before that?

@tapadipti
Copy link
Contributor

Do you plan to start on the "Later" items in #9821 (comment)? Is there anything else needed before that?

I started working on one of the items in GTO (here), and will gradually pick the others. But probably not the DVC/Studio tasks.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed discussion requires active participation to reach a conclusion labels Oct 25, 2023
tapadipti added a commit to iterative/gto that referenced this issue Oct 26, 2023
Earlier, the model name was not being validated for the deprecate model
action. It was getting validated for all other actions including the
unassign and deregister actions (which are variations of the deprecate
action).

Resolves
iterative/dvc#9821 (comment)
@tapadipti tapadipti reopened this Oct 26, 2023
@dberenbaum
Copy link
Collaborator

Is this still p1? @tapadipti is it still in your plans?

@dberenbaum
Copy link
Collaborator

Lowering the priority of the remaining items

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Dec 12, 2023
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: artifacts Related to `artifacts` section in `dvc.yaml` p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants