-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken NeMo dependencies #372
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
nemo_curator/filters/code.py
Outdated
try: | ||
from nemo.collections.common.tokenizers import SentencePieceTokenizer | ||
except (ImportError, ModuleNotFoundError): | ||
from .sentencepiece_tokenizer import SentencePieceTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about this?
ModuleNotFoundError: No module named 'nemo'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussions from slack, I think we can just transform this class to be something like this:
class TokenizerFertilityFilter(DocumentFilter):
def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
if path_to_tokenizer is None:
raise ValueError(
"Must provide a valid path to a SentencePiece " "tokenizer"
)
self._tokenizer = sentencepiece.SentencePieceProcessor()
self._tokenizer.Load(path_to_tokenizer)
self._threshold = min_char_to_token_ratio
self._name = "tokenizer_fertility"
def score_document(self, source):
tokens = self._tokenizer.encode_as_pieces(source)
num_chars = len(source)
num_tokens = len(tokens)
if num_tokens == 0:
return -1
return num_chars / num_tokens
def keep_document(self, score):
return score >= self._threshold
Then we can just delete the one file you copied over. Lmk what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably run this via a batch instead of running it on a per file pieces and return a single file. We can also probably use crossfit
for it (if we want to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what that will look like
cf.op.Tokenizer(model, cols=["text"], tokenizer_type="sentencepiece")
That said, we might have to ensure this works in a CPU environment too so there might be some complexity here we need to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @VibhuJawa ! I have opened #377 to track this.
import torch | ||
|
||
|
||
class SentencePieceTokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -54,14 +55,14 @@ dependencies = [ | |||
"lxml_html_clean", | |||
"mecab-python3", | |||
"mwparserfromhell==0.6.5", | |||
"nemo_toolkit[nlp]>=1.23.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #376.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also probably need to add torch
as a dependency now. We inherited that from NeMo. Though, not sure if the HF libraries pick that up automatically.
Looking at the logs, it looks like |
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick I think there's a place in the user guide under |
Yes I have updated docs/user-guide/image/gettingstarted.rst - let me know if there's another somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops sorry I was blind. I'm used to the user guide being the first thing in the side bar. Looks good.
* add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com>
* update obsolete flag Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Improve caching (#352) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on main (#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on merge commit (#355) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Add conda env to `$PATH` (#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add newline Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run cleanup always Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Add `build-test-publish-wheel` CI file (#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Create package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * remove extra version string Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * add `__all__` Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Fix version Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/sarahyurick/ci/build test publish wheel (#358) * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run isort Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken TestPyPi builder (#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update Dockerfile Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * chore: Add `CHANGELOG.md` file (#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Release workflow (#360) * add file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow to allow of `devN` semver (#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add code-freeze workflow (#367) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add cherry pick workflow (#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken NeMo dependencies (#372) * add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow (#373) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Skip reading files with incorrect extension (#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add type checking Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * isort Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address ayush's comments Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> --------- Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Vinay Raman <viraman@nvidia.com>
* update obsolete flag Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Improve caching (NVIDIA#352) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on main (NVIDIA#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on merge commit (NVIDIA#355) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Add conda env to `$PATH` (NVIDIA#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add newline Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run cleanup always Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Add `build-test-publish-wheel` CI file (NVIDIA#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Create package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * remove extra version string Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * add `__all__` Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Fix version Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358) * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run isort Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken TestPyPi builder (NVIDIA#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update Dockerfile Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * chore: Add `CHANGELOG.md` file (NVIDIA#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Release workflow (NVIDIA#360) * add file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow to allow of `devN` semver (NVIDIA#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add code-freeze workflow (NVIDIA#367) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add cherry pick workflow (NVIDIA#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken NeMo dependencies (NVIDIA#372) * add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow (NVIDIA#373) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Skip reading files with incorrect extension (NVIDIA#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add type checking Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * isort Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address ayush's comments Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> --------- Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Vinay Raman <viraman@nvidia.com>
* add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
* update obsolete flag Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Improve caching (NVIDIA#352) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on main (NVIDIA#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on merge commit (NVIDIA#355) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Add conda env to `$PATH` (NVIDIA#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add newline Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run cleanup always Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Add `build-test-publish-wheel` CI file (NVIDIA#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Create package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * remove extra version string Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * add `__all__` Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Fix version Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358) * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix * fix --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run isort Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken TestPyPi builder (NVIDIA#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update Dockerfile Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * chore: Add `CHANGELOG.md` file (NVIDIA#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Release workflow (NVIDIA#360) * add file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow to allow of `devN` semver (NVIDIA#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add code-freeze workflow (NVIDIA#367) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add cherry pick workflow (NVIDIA#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken NeMo dependencies (NVIDIA#372) * add packaging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to requires Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move to github ci file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add pin Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add torch Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add suggestion from mamba readme Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try github install Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comma Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * another attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nemo toolkit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add datasets Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try removing cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove cython Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * sentencepiece Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * apply ryan's suggestion Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow (NVIDIA#373) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Skip reading files with incorrect extension (NVIDIA#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add type checking Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * isort Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address ayush's comments Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * trailing whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more whitespace Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * address praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * praateek's review Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> --------- Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
See example failure: https://github.com/NVIDIA/NeMo-Curator/actions/runs/11844241061/job/33007266564?pr=318