Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken NeMo dependencies #372

Merged
merged 16 commits into from
Nov 15, 2024
Merged

Conversation

sarahyurick
Copy link
Collaborator

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick changed the title Add packaging module Fix broken NeMo dependencies Nov 14, 2024
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick requested a review from ko3n1g November 15, 2024 02:54
@sarahyurick sarahyurick mentioned this pull request Nov 15, 2024
3 tasks
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 15, 2024
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Comment on lines 105 to 108
try:
from nemo.collections.common.tokenizers import SentencePieceTokenizer
except (ImportError, ModuleNotFoundError):
from .sentencepiece_tokenizer import SentencePieceTokenizer
Copy link
Collaborator Author

@sarahyurick sarahyurick Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about this?

ModuleNotFoundError: No module named 'nemo'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions from slack, I think we can just transform this class to be something like this:

class TokenizerFertilityFilter(DocumentFilter):

    def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
        if path_to_tokenizer is None:
            raise ValueError(
                "Must provide a valid path to a SentencePiece " "tokenizer"
            )
        self._tokenizer = sentencepiece.SentencePieceProcessor()
        self._tokenizer.Load(path_to_tokenizer)
        self._threshold = min_char_to_token_ratio

        self._name = "tokenizer_fertility"

    def score_document(self, source):
        tokens = self._tokenizer.encode_as_pieces(source)
        num_chars = len(source)
        num_tokens = len(tokens)
        if num_tokens == 0:
            return -1
        return num_chars / num_tokens

    def keep_document(self, score):
        return score >= self._threshold

Then we can just delete the one file you copied over. Lmk what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably run this via a batch instead of running it on a per file pieces and return a single file. We can also probably use crossfit for it (if we want to)

Copy link
Collaborator

@VibhuJawa VibhuJawa Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what that will look like

 cf.op.Tokenizer(model, cols=["text"], tokenizer_type="sentencepiece")

That said, we might have to ensure this works in a CPU environment too so there might be some complexity here we need to fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VibhuJawa ! I have opened #377 to track this.

import torch


class SentencePieceTokenizer:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -54,14 +55,14 @@ dependencies = [
"lxml_html_clean",
"mecab-python3",
"mwparserfromhell==0.6.5",
"nemo_toolkit[nlp]>=1.23.0",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #376.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also probably need to add torch as a dependency now. We inherited that from NeMo. Though, not sure if the HF libraries pick that up automatically.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Nov 15, 2024

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@ryantwolf
Copy link
Collaborator

@sarahyurick I think there's a place in the user guide under images/gettingstarted.rst that has the cython install instructions too.

@sarahyurick
Copy link
Collaborator Author

@sarahyurick I think there's a place in the user guide under images/gettingstarted.rst that has the cython install instructions too.

Yes I have updated docs/user-guide/image/gettingstarted.rst - let me know if there's another somewhere.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry I was blind. I'm used to the user guide being the first thing in the side bar. Looks good.

@sarahyurick sarahyurick merged commit 363a66b into NVIDIA:main Nov 15, 2024
3 checks passed
davzoku pushed a commit to davzoku/NeMo-Curator that referenced this pull request Nov 19, 2024
* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>
VibhuJawa pushed a commit that referenced this pull request Nov 19, 2024
* update obsolete flag

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Improve caching (#352)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on main (#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on merge commit (#355)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Add conda env to `$PATH` (#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* test

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* add newline

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run cleanup always

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Add `build-test-publish-wheel` CI file (#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Create package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* remove extra version string

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* add `__all__`

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Fix version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Ko3n1g/sarahyurick/ci/build test publish wheel (#358)

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run isort

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken TestPyPi builder (#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* chore: Add `CHANGELOG.md` file (#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Release workflow (#360)

* add file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow to allow of `devN` semver (#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add code-freeze workflow (#367)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add cherry pick workflow (#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken NeMo dependencies (#372)

* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow (#373)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Skip reading files with incorrect extension (#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add type checking

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* isort

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address ayush's comments

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* more whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

---------

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* update obsolete flag

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* test

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* add newline

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run cleanup always

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Create package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* remove extra version string

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* add `__all__`

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Fix version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run isort

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add type checking

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* isort

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address ayush's comments

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* more whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

---------

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* update obsolete flag

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* test

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* add newline

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run cleanup always

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Create package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update package_info.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* remove extra version string

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* add `__all__`

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Fix version

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* run black

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* run isort

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update __init__.py

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

---------

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* fix

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to requires

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move to github ci file

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add pin

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add torch

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try github install

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add comma

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* another attempt

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add datasets

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* try removing cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* remove cython

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* sentencepiece

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add type checking

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* isort

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address ayush's comments

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* run black

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* trailing whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* more whitespace

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* address praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* praateek's review

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>

---------

Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants